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<§) Apsp I/O programmable router* 



A parsHai array processor for massively partite) applications is formed with low power CMOS v/ith DRAM 
process big while Incorporating processing atemente on a smote chp. Eight processors on a single chip have 

a the* own associated processing e&neot, stgrtCcent memory, and I/O and are to|6*ceimected with a hypeicube 
based, out nvHfflied. topotogy. These nodes are men imeroonnactsd, either ©y a hypercuoe. mooted hyper- 
0) cube, or ring, or ring within ring network topology. Conventional microprocessor MMPa consume pins and time 
CS| going to memory. The now architecture merges processor and memory with multiple PMEs (eight 16 bit 
IN processors vflh 32K and vO) In DRAM and has no memory access delays end uses all the pins tor nefroridng. 
q The chip can be a single node of a fine-grained parallel processor Each chip wi3 have eight 16 bit processors, 
rs each processor provicfing 5 mips performance. I/O has three tngemel ports and one edemal port shared by the 
U) plural processors on As chip. Significant software ftaribiity is provided to <mabie quick impfementetion of 
O listing prograns written in common languages. II Is e devetopebto and expandable technology without need to 
develop new ptnouts, new software, or new utilities as chip density increases and new hardware b provided for a 
Q chip function. The scalable chip PME has Internal end external connsctlonB for broadcast and asynchronous 
6IMD, MMD end SIMIMD (SIMD/MIMD) wHh dynamte swttcWng of modes. The dap can be used In systems 
which employ 32, 84 or 128*000 processors, ax) can be used for lower, intermediate and higher ranges. Local 
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FIELD OF THE INVENTIONS 

The invention relates to computet and computet systems and particularly to parallel array processor*. 
In accordance wrih the invention, a parade! array processor (APAP) may bs incorporated on a RSngts 
semtoondoctor silicon chip. This chip forms a basis for the systems described which are capable of 
massively parallel processing oi complex sc*enafte and business appDcations, 

REFERENCES USED IN THE DISCUSSION OF THE INVENTIONS 

In the detailed discussion of (he irwenfion, other works v«Jl be referenced, inducBng references to our 
own unpubftsned works which ere not Prior Art, wNch will aid the reader tn following tne discussion. 

GLOSSARY OF TERMS 
o ALU 

ALU Is the arithmetic logic unit portion of a processor, 
o Array 

Array refers Id an arrangement of elements m one or more dimensions. An array can Include an 
ordered set of daia Hems {array element) which in languages Oca Fortran are Identified by a single 
name. In other languages such a name of an ordered set of date items refers to an ordered collection 
or set of data elements, all of which haw Identical attfbutes. A program array has dimensions 
specified, generally by a number or dimension attribute. Tne declarator of the array may also specify 
ihe size of each dimension of the array in some languages, m some languages, an array Is an 
arrartgemerd of elements In a table, tn a hardware sense, an army is & collection of structures 
(functional elements] which are generally Wemlcel tn a massively psraSel archttecaire. Array elements 
tn data parallel computing are elements which can be assigned operations and when perattel can each 
mdependently and In parallel execute the operations required. Generally, arrays may be thought of as 
grids of processing elements. Sections of the array may be assigned sectional data, so that sectional 
data can be moved around in a regular grid pattern. However, data can be indexed or assigned to an 
arbitrary location In an array. 
0 Array Director 

An Array Director is a unit programrned as a controller for an array. It perrorrns the function of a 
master controller for a grouping of htnctional elements arranged In an array 
o Anay Processor 

There two principal types ot array processors - mufiiple Instruction multiple data (MIMD) and single 
instruction multiple data (SIMD), In a MWD array processor, each processing element in tne array 
executes its own unique instruction stream wim its own data In a SWD array processor, each 
processing element in the array is restricted to the same instruction via a common instruction stream; 
however* the data associated with each processing element is unique. Our preferred array processor 
naa oth&r ctow-actenssce. We cell & Advanced Parallel Array Processor, and use the acronym APAP, 
o Asynchronous 

Asynchronous Is without a regular time relationship; die execution of a function is unpredictable with 
respect to the execution of other functions which occur wfmout a regular or predictable time 
relationship to other function executions. In control situations, a controller will address a location to 
which control is passed when data is waiting for an idle element being addressed. tos permits 
operations to remain In a sequence while they are out of time coincidence with any event 
o BOPS/GOP8 

BOPS or OOPS axe acronyms having the same meaning - billions of operations per second. See 
OOPS. 

o Circuit Switched/Store Forward 

These terms refer to two mechanisms for moving data packets through a network of nodes. Store 
Forward is a mechanism whereby a data packet is received by each intermediate node, stored into its 
memory, and (hen forwarded on towards rte destination. Occult Switch is a mechanism whereby an 
Intermediate node is commanded to logtceny connect its input port to an output port eucn that data 
packets can pass drrcctfy through the node towards their destination, without entering the intermediate 
node's memory. 

o Cluster 

A cluster is a station (or functional unit) which consists of a control unit fctuster controller) and the 
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hardware (which may be terminate, functions* units, or virtual compong n U) attached to it Our Cluster 
inchJdeB ail array of PMEs aonetrmea called a Node array. Usually a cluster has 512 PfcSEs. 

Our Entire PME node array consists of a s« of clusters, each duet* supported by a cluster 
controller (CO), 
o Cluster controller 

A duster controller is a device thai cortird* Inpuftoutput #0) operaaons tor irw« man one device or 
functional unit corrected to it A ctowwr cararofler is usually comrofied by a program stored and 
executed in the u«H as it was in the BM 3001 Finance Communication Controller but it can be 
entirely controlled by hardware as n was in ihe IBM 3272 Control una, 
o Cluster synchronizer 

A ouster syncftronlior * « functional urtt which manages the operations of eJi or part of a cluster to 
maintain synchronous operation or the elements so tfiat the torsional unite maintain a particular time 
relabonthip with the execution of a program, 
o Controller 

A rontntfter is a device tftat directs the transmission of data end mstructfcna over the finks of an 
interconnection network; Its operation is controlled by a program executed by a processor to vmich 
fte controller is connected or by a program executed within the device, 
o CMOS 

CMOS Is an acronym for Complementary MeteKfcdde Semiconductor technology. It Is cornmonJy 
used to manufacture dynamic random access memories {DRAMs*. NMO$ is another technology used 
to manufecfcj re DRAMS, Wa prefer CMOS but the technology used to manufacture the APAP is not 
intended to Imli the scope of the somH^fcctor lochnoiogy *ttch is employed, 
o Dotting 

DotUng refers to the joining of three or more leads by physically correcting them togefcer. Most 
beckpanei busses share this connection approach. The term relates to OR DOTS of tknes past but is 
used here to identity multiple data sources that can be combined onto ft bus by a very simple 
protocol. 

Our I/O Tippet conceot can be used to implement the concept that port **to a node could be 
drrven by me port out of a node or by data coming from the system bus. Conversely, data being put 
out of a node would be avaiahle to both me input to another node and to the system bus. Note thai 
outpuSSng data to both trie system bus and another node is not done simufeneousJy but in different 
cycles. 

tJotting is used In the H-OOT dlscueafone where T*o-ported PEa or PMEa or Pickels can be used 
in arrays of vartous organizations by taking advantage of dotting. Several topologies are discussed 
mchjdmg 20 and 3D Meshes. Base 2 N-cube. Sparse Base 4 N-oube, and Sparse Base 8 N^cube 
) DRAM 

DRAM is an acronym for oynamte random access memory, the common storage used by computers 
for main memory. However, the term DRAM can be applied to use as a cache or as a memory which 
Is not the main memory. 
> aOATIWS-POINT 

A floBting-point number is expressed in two parts- There is a fixed point or fraction part, and an 
exponent part to some assumed radiv or Base. The exponent indicates the actual placement of the 
decimal point. In the typical floating-point representation a rsat number 0.0001 234 is represented as 
0.1234-3. where 0.1234 is the roed-poait part and O is me exponent to this example, ihe floating- 
point radix or base is 10, vrfiere 10 represents the ^mpncft fixed posWve imager base, greater man 
unity, mat Is raised to the power expOdtjy denoted by the exponent m the floatingpoint representation 
or represented by the characteristic to the floating- point repreaentfltlon and (hen multiped by we 
fixed-point part to determine the real number represented. Numeric Iterate can be expressed in 
floating-point notation as welt as real numbers, 
FLOPS 

lui ^ ^ * "J***" foasructtons per second, netting^* operaaona include ADD, 
JtJ^ 1 £f and often P^ng-poini instructions per second parameisr is often 

calculated using the add or multiply instructions and. in general, may be considered to have a 50/50 
rrax. An oper^inckides the generation of exponent; fraction and any reared iracaon normalize- 
ion. We could address 32 or 4*bit flomlr^pomt formats fcr longer but we have not counied them in 
tne mi*) A floating-point operation when implemented with fixed point instructions {normal or RISC) 
retires multiple insbixtione. Some use a 10 to 1 ratio in figuring performance while some specif* 
studies have shown a ratio of 625 more appropriate to use. Various architectures will have different 
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ratios, 
o Funcftmsl unit 

A functional unU is an entity of hardware, software, or both, capable of accomplishing a purpose, 
o G&ytes 

5 Gbytas refors to a b»ton bytes Gbyte&s would be a biHion by tea per second, 
o GIGAFLOPS 

(10T9 floetlng-poln; Instructor per second 
o QOPSand PETAOP9 

GOP6 or BOPS, have fta same moaning - billions of operations per eocond. PETAOPS means 
jo irilBone of operations per second, a potential of the current machine. For our APAP machms #>ey are 
Just about the same as BIFVGPs moaning billons of Instructions per second In some machines an 
Instruction can cause two or mom operations (te. both an add and muttipr?) but we don't do that 
Alternatively it could take many instrwtione to do an op. For example we use multiple instructions to 
perform 64 bit arftimeac- In counfeig ops however, we did not elect to count log ops. GOP 8 may be 
7* the preferred use (0 describe performance, but there la no consistency tn usage that has been noted. 
One see* MFs*IOP$ then BlrVBOPs and Me9aFLC^Srt^LOPS.TeraFLOPSrPetaF»op3. 
0 ISA 

l$A mesne the tostruceon Set Ar chttecfe»e. 
O Link 

70 A link is an element which may be physical or logical. A physical link is ihe physical correction for 
joining elements ot units. whHe in computer progiammlng a Rnk la an instruction or address that 
passed control and parameter* between separate portions 07 the program, m multisystem* a link la 
(he connecBcn between two systems which may be specified by program code Identifying the Dnk 
which may be identified by a real or virtual address. Thus generally a Bnk includes the physical 

ffs medium, any protocol, and associated devices and programming; it is both logical and physical, 

o MFLOPS 

M FLOPS means (lore floating-point instructions per second, 
o MIMD 

mimd i$ used to refer to a processor array architecture wherein each processor in the array has 4a 
30 own Instruction stream, thus Multiple Instruction stream, to execute MutHple Data streams located one 
per processing element 
o Module 

A module re a program unit that ie discrete and identifiable or afunctional unit ot hardware a&efyned 
lor use with other components. Also, a collection of PCs contained m a single electronic chip is called 
» a module. 
0 Moda 

Generally, a node is the junction a\ links. In a generic array of PEs, one PE can be a node. A node 
can also contain a collection of pes caned a module, m accordance with our invention a node ie 
termed of an array of PMEs, and we refer to the set of PMEs as a node. Preferably a node is S PMEs, 
<q o Nocte array 

A collection of modules meda up of PMEs is sometimes referred to as a node at ray, is an array of 
nodes made up of modules. A node array is usuefly more than a few PMEs, but the term 
encompasses a plurality. 
O POE 

46 A PDE is a partial differential equation. 

0 PDE relaxation solution process 

PDE relaxation solution process is a way to solve a POE (partial differential equation}. Solving PDEs 
U368 most of the super computing compute power in the known universe and can therefore be a good 
example of the relaxation process. There are many ways to solve me PDE equation and more than 

so one of the numerical me&ods indudes the relaxation process. For example, if a PDE is solved by 
finite element methods relaxation consumes the bulk of the computing lime. Consider an example 
from the world of heat transfer. Given hot gas inside a chimney and a cold wind outside, how wiP the 
temperature gradisrrt within the chimney bricks develop? By considering the bricks as tiny segments 
and writing an aquation that says how heal flows between segments as a function of lernperatum 

as differences then the heat transfer PDE has been converted into a finite element problem. If we then 
say an elements except those on the inside and outside are at room temperature while the boundary 
segments are at the hoi gas and cold wind temperature, we haae set up the problem to begin 
rei&xstion. The computer program than models time by updating she temperature variable in each 
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segment based upon the amount of hear that flows into or out of the cedent K takes many cycle* a 
proceeding el die segments In the model before the eel or temperature variables across the cltimney 
to ^present actual temperature <f$trixrtbn that would occur in ma physical chimney, ff the 
ottfecfive was to model 999 cooling in t?w obimney then me elements would haw to extend (0 gas 

s equations, and the boundary condftona on me Inside would be linked to another finite dement modal, 

and me process corrSnuas- Note that it* heet flow is dependent upon the temperature Difference 
between the segment and 83 neighbors, k thus uses ihe Irrter-Pe cornmunksrfon paths to distribute 
the temperature variables, ft is this near neighbor communication pattern or characteristic thai makee 
PDE relation very applicable to p&rsJtei cojTipuiina 

w> o PICKET 

TWs 19 the element in an anay of elements maWng up an array processor, it consist* c* date flow 
<ALU REGS), memory, control, and tne ponton of tr>* cowmrttetion mairlx associated with tie 
element. The unit refers to a lihih of an array processor made up of parallel processor and memory 
elements wttfi their control and portion of Ihe array bn^commurjicatton mechanism. A picket Is a form 

« of pmcesaor memory element or PME Our PME chip design processor lopjc can implement the 

picket logic described in related apportions or have the logic for the array of processors formed as a 
node. The term PICKET is similar to the commonly used array term PE for processing element and 
la an element of the processing array preferably comprised of a combined processing element and 
local memory for processing bit paraJlel bytes or Information m a dock cycle. The preferred 

» embooimeni consisting of a byte wide data flow processor, 32k t)ym or mora of inemory. primhiv* 
controls and ties to ccrnmwkatfions with other pickets, 

Ttie term "picker comes from Tom Sawyer and hfa white fence, although it will also be 
understood functionally that a military picket line analogy me quite well, 
o Picket Chip 

29 A picket chip contains a plurality of pickets on a single silicon chip. 
0 Picket Processor system (or Subsystem) 

A picket processor is a total system consisting of an array of pickets, a ccrnrminlcaUon network, an 

VO system, and a SIMD controBer consisting of a microprocessor, a canned routine processor, and a 

micro<cntrofter that runs the array. 
a> o Picket Architecture 

The Picket Architecture is the preferred embociment for tte SMD architecture wnh features mat 

accommodate several diverse kinds of problem? including: 

- eat associative processing 

- parallel numerically tafensfue processing 

sa - physical array processing similar to images 

0 Picket Array 

A picket array is a coltection of pickets arranged in a Qeometric order, a regular array, 
o PME or processor memory element 

pme is used for a processor memory element We use trie tew PME to refer k> a singfe processor. 

^ memory and UQ capable system eJement or unit that forme one of our parallel array processor*. A 
prooeesor memory element is a term which en<»mpasses a picket A processor memory element is 
1mtb of a Processor array nttch comprises a processor. Its associated memory, control interface, and 
a portion of an array communication network mechanism. This element can have a processor memory 
etainent wflh a unnadfetty of a regular array, as in a picket processor, or as part of a suns/ray. as In 

<* the multi-processor memory eJernen? node we have descrt>e<l. 
o Routing 

Routing is ihe assignment of a physical path by which a message wHl reach its destination. Routing 
assignments have a source or origin and a destination. These elements or addresses have a 
temporary reiaocnsNp or affinity, atari, message routing * based upon e key which is obtained by 
so reference to a table of assignments, hi a network, a destination is any station or network addressable 

undressed as the destination of information transmitted by a path control address that identifies 
tte ink. The destination field kJenditee the destination with a message header destination code. 
0 $IMD 

A processor array erchsecture vrherein ai processors in the array are commanded from a Sinde 
« Instruction saesm to execute Multiple Data streams located one per processing element. 

0 $WDMIMD or &MDMUD W 

SWDM1MD or SftlD/MMD is a wrm referring to a machine mat has a dual function that can switch 
from MIMD to SiMD tor a period of time to handle some complex instruction, and hue has two 
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modes. The Thinking Machines, Inc. Contraction Machine mods! CM-2 when placed as a front end or 
back end of a MIMD machine permitted programmers to operate different modes for execution of 
different parte erf a problem, referred to eomeibnee a dual modes. These machine* have existed since 
qliao and have employed a bus that interoonrocte the master CPU with other processors. The master 

9 control processor would have (he capability of Interrupting Hie processing of other CPUs. The other 
CPU* could run independent program code. During an interruption, some provision must be made for 
checkpointing (cfostog and saving current statue of the comroHed processor*). 

0 SMIMD 

SI Ml MO ia a processor array architecture whereto an processors In the array are commanded from a 
70 Single Instruction stream, to execute Multiple Data streams located one per processing element 

Wtshin this construct data dependent operations within each picket that mimic instruction execution 

are controlled by the SI MO instruction stream. 

This is a Single Instruction Stream machine with the abifly to sequence Multiple Instruction 

streams (one per Picket) using the SIMD instruction stream and operate on Multiple Data Streams 
js (one per Picket). SIMIMD can be executed by 8 processor memory element system, 

SISD 

SISD is an acronym far Single Instruction Single Data. 
20 o Swapping 

Swapping interch a nges the data content of a storage area wfti that of another area of storage- 

o SyrKtwonoua Operation 

Synchronous operation in a MIMD machine is a mode of operation In which each action is related to 
an event (usually e dock}: it can be a spedSed event that occurs regularly in a program sequence. An 
2* operation Is dispatched to a number of PEs who then go off to Inoepenoenrjy per form the function. 
Control is not returned to the controller until the operation Is completed. It the request is to an array of 
Functions! unite, the request is generated by a controller to elements in the array which must complete 
their operation before control Is returned to the controller. 

0 TERAFLOPS 

30 TERAFLOPS means (10pt2 floating-point Instructions per second, 

o VLSI 

VLSI is an acronym for very large scale integration (as applied to integrated circuits}, 
o Zipper 

A zipper Is a new function provided. lr allows tor links to be made from devices which ate external to 
ss the normal interconnection of an array configuration. 

BACKGROUND OF THE INVENTION 

In the never ending quest tor faster computers, engineers are linking hundreds, and even thousands of 

10 tow coat microprocessors together in parallel to create super supercomputers that divide in order to conquer 
complex problems that stump today's roachinse. Such machinss are called massively parallel. We have 
created a new way to create massively parallel systems. The many improvements which we have made 
should be considered against the background of many works of others. 

Multiple computers operating in parallel have existed tor decades. Earty parallel machines included the 
it illiac which was started in me 1960s, iluac iv was butt in the 1970s. Other multiple processors Include 
(see a porta) summary in US. Patent 4^75,8$4 issued December 4, 1990 to Xu et at) the Cedar, Sigma- 1, 
the Butterfly and the Monarch, the Intel ipsa The Connection Machines, the Caltech COSMIC, the N Cube, 
IBM's RP3v IBM's GFH , the NYU Ultra Computer, fre Intel Delta end Touchstone. 

Large multiple processors beginning with ILLIAC have been considered supercomputers, Supercom- 
3a paters with greatest commercial success have been based upon multiple -rector processors, represented by 
the Cray Research Y-MP systems, me IBM 3090, and other manufacturer's machines inctudng those of 
Amdahl, Hitachi, Fujitsu, and NEC. 

Massively Parallel Processors (MPPs) are now thought of as capable of becoming supercomputer 
These computer systems aggregate a targe number of mfcroprooeseors with an interconnexion network 
30 and program ftem to operate in parallel. There have been two modes of operation of these computers. 
Some of these machines have been mimd mode machines. 

Some of these machines have been SIMD mode machines. Perhaps me most commercially acclaimed 
of these machines has been the Connection Machines series 1 and 2 of Thinking Machines, toe . These 
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have bean essenaalry StMD machine Many of the massively parallel machines have used microprocessors 
interconnected in parallel to obtain treir concrfrency or parattei operators eaprtfity. Intel mioroprcoeseom 
Ike i860 have been used by total and others. N Cube has made such machines with Intel '388 
mtoroproeeesora, Other machine haw been bum with what fe catted the "transputer* chip, Irimos 
Transputer IMS TBOO b en example. The Inmos Transputer TBOO is a 32 bit device w!*i an Integra high 
speed ftaaeng pant processor 

As an example erf to KM of systems that are bum, several Inmos Transputer TBOO chips each would 
have 32 communication tnfr topute end 32 IWc output* Each chip would have a stogie processor, a small 
amount of memory, end comrr«tf cation links to the local memory and to an externa) Interface. In add-on. 
In order to build up the system corruixinicalion fink adaptors like IMS C011 and O0I2 would be connectad. 
la addition switches, m a MS 0004 would pro***, say, a crosabar swtech between the32*nfe inputs and 
32 link outputs to provide pdrrt-to-potor connection between additional iransputer chfca. In addHfen. there 
w» be special circuitry and interface chips for transputer* adapting them to be used lor a special purpose 
tailored* the requirement* of a specific device, a graphics or disk controller. The Inmos IMS M2i2 is a 19 
bit processor, wHh on chfp memory and oomrnurtcelkw finka. ft contains hardware and logic to control disk 
drives and can be used as a programmable disk corrtroier or as a general purpose interface, ki c<der to use 
the concuncncy {parallel operations) Inmos developed a special language, Occam, fcr the transputer. 
Programmers haw to descrfee the notwortr of transputers directly in en Occam prolan. 

Seme of these massively parallel rnacWnes use paraHeJ processor arrays of processor chips which are 
bi^nconnected wi* different topdorjiee. The neneputer provfdee a crossbar network wim the add-on of IMS 
C004 chips- Some crfier systems use a hypercube cennectwa Others use a bus or mesh to connect the 
irOcrcfMoceseors and there associated circuitry. Some have been Interconnected by circuit swuch proces- 
sors that use swHches as processor addressable network*. Generally, as m* 14 riscWOOs vmtch 
were interconnected last tafl el Lawrence livermor* by wring Ihe machines together, the processor 
adoYessabJe networks have been considered aa coarse-grained multiprocessors* 

Some very large macninss are being bum by Intel and nCube and osiers to attack what ere called 
"grand chsSengeB" in daia processing. However, these computers ere very expensive, Recent projected 
costs ate In the order ot mw^OOOOO to S754W0.QQO.0Q (Tore Oompojei) for computers whose 
deveiopmeni has seen funded by me U.S. Government to attack the "grand chaJlengee # . These "grand 
chatenges" would Include such prootams as climate modeling, fluid turbulence, poiksUori dispersion, 
mapping of the human genome and ocean circulation, quantum chromodynamtes, semiconductor and 
supercomputer mooting, combustion systems, vision and cogniugn. 

As a footnote to our background, we should recognise one of the early massively parallel machines 
developed by IBM. to our description we have chosen to use the term p*ooesso* memory element tether 
than "transputer" to describe one ot the eight or more memory unite with processor and 10 cepabilraes 
which make up the array of PMEs io a chip, or node. The referenced prior m Transputer* has on a chip 
one processor, a Fortran coprocessor, and a smaB memory, wKh an h%> interlace. Our processor rnernory 
etemerrl could apery to a transputer and to mo PME ot the RP3 generally. However, aa will bo recognized, 
our We chip is significantly different In many respects. Our lithe chip has many feature? described later. 

J 30 that *e term PME was *sf coined tor another, now more typical, pme which 
termed (he basis for the massfvefy parallel machine known as the RP3. The IBM Research Parallel 
Processing Prototype (RP3) was an experimental parallel processor baaed on a Mufipie Instruction Multiple 
Oata (MIMDi archoecture. RP3 was designee' and built at IBM TJ, Watson Research Center In cooperation 
with Ihe New York University Uttracompuier project This work was sponsored in part by Defense Advanced 
Research Project Agency. RP3 was comprised of W Froceese^Memory Bements (PMEs) I reconnected 
by 9 high speed omega network. Each PME contained a 32-bit IBM *PC scientific* microprocessor, 32*8 
cache, a 4-MB segment of Ihe system memory, and an LO port. The PME I/O pod hardware and software 
supported ioftiaRzaSon, status acqutshlcw, as well as memory and processor communication through shared 
W support Processor* (ISPs). Each ISP supports eight processor- memory elements through the Extended 
K? adapters flrTIOs), independent of the system network Each ISP Interfaced to the IBM S/370 channel 
and the IBM Token-Ring network aa well as provrfsig operator monster service. Each extended VO adapter 
****** f 3 * * ■ PME R0MP ****** Channel <RSQ and provided programmable PME 

awtrol/staius signal I/O via the ETIO channel. The ETX> channel is the 32*it bus which ^connected the 
BP to the efgw addjtfe*. The ETK) channel reled on a custom interface protocol with was supported bv 
rtardware on the ET10 adapter and software on the 
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Problems addressed by our APAP machine 



The machine which we have cafled 1he Advanced Parallel Array Processor (APAP) to a fine-grained 
peratie! processor which we believe is needed Id address issues of prior designs- As i Dilated above, there 

6 have been many tine-grained (and also coarse-grained) processors constructed From both point design end 
ofMhe-ahelf processors using dedicated and shared memory and any one of tie many possible Intercon- 
necdon schemes. To daw these approaches have all encountered one or more design and performance 
llrmtattons. Each w soiuHon r loads in a different direction. Each has Its problems. Existing parallel machines 
are difficult to program. Each is not generally adaptable to various sizes of machines compatible across a 

?o range of applications. Each has hs design imitations caused by physical design, interconnection and 
arcnttectuf a) issues. 

Physical jaauee 

;s Some approaches utilzB a separate chip design for each of the venous functions required in a 
horizontal structure. These approaches suffer performance limitations due to chip crossing delays. 

Osher approaches Integrate various functions toge&er vertically into a single chip. These approaches 
suffer performance limitations due to the physical Hrrut on the number of tegto gates which can be 
Integrated onto a producible chip*. 

so 

Interconnection Issues 

Networks which interconnect the various processing functions are important to fine-grained parallel 
processors. Processor designs with buses, meshes, and hypercubes have all been developed. Each of 

20 these networks has inherent arnfleflons as to processing capa^Rty. Buses limit both *e number of 
procesaOfs wrucn can oe pnystcauy interrx>nnecreo ana tne runwonc perrormsnce. Mesnss ieaa to large 
network diameters which limit network performance, Hypercubes require each node to have a large number 
of interconnection ports; the number of processors which can be Interconnected Is limited by the physical 
input-output pins at the node. Hypercubes are recognized as having some sio^rffcant performance gains 

so over Ihe prior bus and mesh Blructorea. 

Architecture! Issues; 

Processes which are suitable for tine-grained parallel processors fall into two distinct types- Processes: 
38 which are functionaty partiflonabie tend to perform better on multiple instruction, mu&pte data {Mi MO) 
architectures. Processes which are not functionally parytionaote but have multiple data streams tend to 
perform better on single instruction, rmiffipte data (81 MO) architectures. For any given appficatioa there is 
Skety to be some number of both types of processes. System frado-offc are required to pick the 
architecture which best suits a parffcufe* application but no single solution has been satisfactory. 

SUMMARY OF THE INVENTION 

We have created a new way to make massively parallel processors and other computer systams by 
creating a new "chip" and systems designed with our new concepts. This application is directed to such 

<s systems. Gofliponents described m our appflcattons can be combined in our systems to make new 
systems. They also can be combined wtth existing technology, 

Utinfc, our Bttfe CMOS DRAM chip ol approximately 14 x 14 mm can be put together much like bricks 
are walled in a building or paved to form a brick road. Our chip provides tte structure necessary to build a 
"house", a complex computer system, by cormecwd replication. 

so Racing our development In perspective, four little chips, each one alike, each one with eight or more 
processors embedded m memory wim an internal array capability and external I/O broadcast and control 
interface, would provide me memory and processing power of thlrty-skx or more complex computers, and 
they could ail be placed with compact hybrid packaging into something the size of a watch, and operated 
w&h very low power, as each chip only dissipates about 2 wans, Witn wis chip, we have creased many new 

56 concepts, and tftoee that we consider our invention are described in detail in the description and claims. 
The systems that can be created with our computer system can range from smafl devices to massive 
machines with PETAOP potential. 
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Our little memory chip array processor we call our Advanced Paralle, Array Processor. Thoug* smau\ rt 
to compter, and powerful* A typical duster will have many cttpe* 

Many aspects and features of invention have been described in ft* end rotated eppRcstJon* These 
concepts end features of irwertion improve and are applcable to computer irystem* which may not employ 

i each invention. We believe our concepts and features wtfl be adopted and used in me next century 

This technical desatption parties en overview of our Advanced Parallel Array Processor (APAP) 
representing our new memory concepts and our start in developing a scalable rnassrvery parallel processor 
(MPP) that Is simple (very smal number of unique pert numbers) and bee very high performance. Our 
processor utilizes » Its preferred embodiment a VLSI chip. The chip comprises 2n PME rr^crocomputers 

> n represents the rnaxirnum number of array ^mensionafty. The chip further comprises a broadcast and 
control interface (BCD and internal and extemai oommunkjoson pams between pmes on the omp among 
themselves and to ihe off chip system environment The referred chip has 8 PMEs (but we ateo can 
provide more) and one BCL The 2n PMEs and BCI are considered a node. This node can function in sifter 
SIMD or MIMD mode, in dual SIMG/MODE, *W asynchronous processing, and with SIMIMD furaaJonaWy. 

i 9sxe it is scalable, this approach provides a node which can be the man building block for scalable 
paraflsl processors of varytng size. The rnJcrocomputei architecture of tie PME provides EuUy distributed 
message passing ^connection and control features within each node, or chip. Each node provides 
rmsaple parallel mlcrocornputer capability at the chip level, the rrucroprooesor or personal computer level, at 
a workstation level, at special application levels which may be represented by a vision anoTor evtonfcs level 

' ixii % extended, to capebii*y at greater levels with powerful Gigaftop perfdnnance fcto the' 

superrjompuwr range. The stmpncrty is achieved by the use of a single r*gr*y extended DRAM Chpmsifs 
lepscated Wo parallel dusters. This Keeps the part number count down and allows scaling capefemty to the 
cost or performance need, by varying me chip count, then ihe number of modules, etc. 

Our epproech enables us to provide e machine with t&ributes meeting the requirements that drive Co a 
parelloi solution In a series of wltcaaons. Our methods of paraBeflzetJon at fie sub-chip level serve to beep 
weight, volume, and recurring and logistic costs down. 

Because our different size systems are al based upon a single chip, software tools are common tor all 
sbe systems. This offers the potential of development software (running on smaller workstation machines) 
that is mtercharrgeable among el levels (workstation, aerospace, and siiperccmpuW). That advantage 
means programmer can develop programs on workstations while a production program runs on a much 
larger machine. 

As a result of our well balanced design implementation we meet today's requirements imposed by 
technology, performance, coat and perception, and enable growth of the system into the future. Since our 
mpp approach starts at the chip level, our discussion starts at the chip technology description and 
concfeidea with the supercomputer application description*. 

Physical, iniercennectlon, and architectural issues win an be addressed to the machine directly 
Functions **) not onry be integrated into a etngts chip design, but the chip design win provide functions 
surTrc»rr«y powertrl and flexible ftet the chip will be effective at processing, routing, storage and three 
* im^rconnectton network win be a new version of the hypercube whteh provides minimum 

^01 ^T!^ If?? toW**** ******* Ipmrtatroris wrmalry aseociated with hyper- 

I^^f * * m MIU & are dimmed because the design afeows r^^s^D 

dynamically switch between MIMD end SIMD mode. This eJtmir^es many problems which wui be 
encountered try application pmg^mers of "hybrid* machines, fin add-on, the design vrill alow a subset of 
the proceesc-rs to be in SIMD or MIMD mode. 

The Advanced Paranai Array Processor (apap) is a flne-gramed psraiei processor, it consist of corwi 
and processing sections whJch are pejtroonable such thai corporations suable for supercomputmc 
through personal computing sppAcalions can be aaJJsfiecl In moat configurator* rt would attach to a hoe! 
processor and support the off loading of segments of the hosfs workioed- Because toe apap array 
processing elements are general purpose computers, the particular type or workload oft-loaded will vary 
depending upon the capacities of the host For exampto, our APAP can be a module for an IBM 3090 
vector Processor rrtfunfrarne. When attached to a mainframe »*h high performance vector floating porni 
capabllrry the task ofHoacted might be sparse to dense matrix transformations, Alternatively, when attached 
to a PC personal computer the ofWoeded task might be numertcely tensive 3 dbnenslonal graphics 
processing, 

The abovo wfenwced panr* OSSM tmtim, filed Nomrrtx* 13, «*» of Dieffemderler ,t al, tilted 
Parale) Aaaodatlre Processor Syssm- describes the idea of migrating computer memory and conwi 
tog»c wKWn « smgis chip ants repttetflng »e combination wtthtn th» cNp and Mfding a processor evatem 
ou of repfcautms of the single chip. This approach which la continued and expanded here leads to a 
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system which provkiis massively parallel processing capability at the cost of Q^yetopwig end manufacturing 
only a single chip typo white enhancing performance capability by reducing the chip boundary crossings 
incline length. 

The above referenced parent US8N 07*11,594, mod November 1$, 1000 fflustrated utilization of I- 

s dlmsnstonal I/O stnjctures(essentlaDy a flnear I/O) wHh multiple SMD PMEs attached to ttiat structure within 
a chip. This embodiment elaborates these concepts to dimensions greater man 1. The description which 
follows will be In terms of ^-dimensional I'O structures with & SIM&'MMO PMEs par chip. However, that 
can be extended to greater dimensionality or more PMEs pei dimension as we will describe with respect to 
FIGURES 3, 9, 10.15 and i*. 

70 Our processing element Includes a fuH MO system Including boih data transfer and program Interrupts. 
Our description of our preferred embodiment yM bo primarily described in terms of the preferred 4- 
dimensional VO structures with 8 SIMD^MD PMEs per chip, which has special advantages now m our 
view. However, that can be extended to greater dimensionality or more PMEs per dimension as described 
in our parent application* in addition, for most applications we prefer and have made inventions m areas of 

is greater cKmensfona with hypercube rntemonrtectiona, preferably with the modified hyper cube we describe. 
However, in some applications a 2- dimensional mash interconnection ol chips will be appficabte to a task at 
hand. For instance, in certain misery computers a 2 drme&sionaJ mesh will be suitable and cost effective. 

This disclosure extends the concepts from the rnterprocessor communication to the external 
Input/Output fecffltlea and describes the interfaces and modules required for control of the processing array. 

£e m summary three types of vo. inter-proceador. processors tcrfrom external, end broadcasVoontroi are 
described. Massively parallel processing systems require all these types of VO bandwidth demands to be 
balanced with processor computing capability. WUhrn the array these requirements w9 be satisfied by 
repeating a Id bit (reduced) instruction set processor, augmented wft> very fast interrupt state swapping 
ccpabUrty. That processor is referred to as the PME fltostralrng the preferred embodiment ol our APAP- The 

as charecterlsaca of the PME are completely unique when compared wish the processing elements on other 
massively parallel machines, it permits me processing. rouSng. storage and K> to be completely distributed. 
This is not characteristic of any other design. 

in a hypercube each PME can address as its neighbor, any PME whose address differs to any single bit 
position, m a ring, any PME can address as its neighbor the two PMEs whose addresses differ t i. Tne 

so modified hypercube of our preferred embodiment utilized for the APAP combines these approaches by 
build tag hypercubes out of rings* The iritersecScn of rings is denned to be a node. Each node of our 
preferred system has its PME. memory and I'D. end other features el the node, formed rn a semiconductor 
si Icon low level CMOS DRAM chip. Nodes are constructed from multiple PMEs on each chip. Each PME 
exists to only one ring of nodes* PMEs within the node are connected by additional rings such that 

39 communications can be routed between rings within the node. This leads to ths addressing structure where 
any PME can step messages toward the objective by addressing a pme in Hs own ting or an adjacent PME 
wfiNn the node. In essence a PME can address a PME whose address differs by 1 in one In the 1n*d bit 
field of its ring (where d is the number of PMEs in (he ring) or the PME with (he same address but existing 
in an adjacent tfmensJon. Tne PME etrectrvery appears to exist bin sets of rings, wWie m actualty It exists 

40 omy in one real ring end one hidden ring fotefly contained within the chip. The dimensionality for the 
modfied hypercube is defined to be the value n from the previous sentence. 

We prefer to use a modified hypercube. This la elaborated In the part of this application describing the 
technology. Finally. PMEs within a ring are paired such that one moves data externally clockwise along a 
ring of nodes and the offisr moves data exiernaDy counterclockwise along fro ring of nodes, thus dedicating 

<s a PME to an external port. 

In our massively parallel machine, in our preferred embodiment, the interoonnecaon and broadcast of 
data and instructions from one PME to another PME In the node and externally of the node lo other nodee 
of a duster or PMEs of a massively parallel processing environment are performed by a programmable 
router, allowing reconfiguration and virtual ftexrbffity to the network operation*. This important feature is fully 

so distributed and embedded In the PME and allows for processor communication and data transfers among 
PMEs during operations of the system in SMD and MiMO modes, as well as in the SM&MIMD and 
SWIMO modes of opet atton- 

Within die rings each Interconnection leg is a point-to-point connection. Each PME has a poinHo-point 
connection with the two neighboring PMEs in its nng and with two neighboring PMEs m two adjacent rings. 

55 Three of these point-to-point connections ere internal to the node, while the fourth porrt-tc-potnt connection 
is to an adjacent node. 

The massively parens) processing system uses the processing elements, with their local memory and 
Interconnect topology to connect all processors to each other Embedded within the PME is our funy 
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dfetrlbuted VO programmable muter. Our system also provides si eddriron to the system which provides tee 
£toflty to toad and unload al the processing etemente. With owipper we provide a method tor loedteg and 
untoacSng of the airsay of PEs and thus enabte implementation of « test I/O along an edge of the are/a 
rings. To provXte for external interface I/O any subset of the rings may be broten <un-zpped across some 

5 f*memton(s» wtth the resutam broken paths connected in the external Interface. The co-pendtoo annflca- 
fion emitted "APAP w> ZFPER", filed concurwrSy herewith. USSN . filed May ai992. 

describe* our 'zipper" In additional detail The 'zipper* can ba applied to only the subset of Ink. required to 
support the peak external VO load, which in all configurations considered so far leads to its telno eooSed 
only to one or two edges of the physics design. 

*° The finaltype oJ K> consists of data that must ba broadcast to, or gathered from all PMEs. plus data 
wwch is itoo seeciafeed to fit on the standard buses. Broadcast data includes commands, programs and 
data. Gathered data is primarily status and monitor functions write diagnostic and tost functions are she 
apedaized elements. Each node, in addition to the included sat of PMEs. cordons one Broadest aid 
Control interface (BCD section. 

« Consider PMEs rrlerconnactad in a modrasd 4 dimensional rryp*reub«, network, tl wch ring corfeina 18 
PMEs. then the system win have 32.768 PMEs. The network dbsneter Is 18 stops. Each PME contains In 

^LTHJ t ^^° n ®* * 8 P***** »** so**** routing 

provides the capability to reconfigure in the went of o faulty processing element or node. Inhere* In a 4d, 

* twrn^barto^ft 8819 " b>te WldS h— dup,e>l rin B s b " !e pnwialen for 410 gigabytes per second peek 

The 4 dimensional hypercube leads to a particularly advantageous package. Boh of the PMEs 
(inducing data flow, memory end I/O paths and controls) are encompassed in a single chip. Thus, a node 
win be a emoje cw P including pe*> of eteme** along the rings. The nodes are contoured together in *, S 
X 8 array to make up e duster. The tody populated machine b built up of on army of 0 X 8 clusters to 

a provide the maximum capacity of 32.768 PMEs. 

Each PME is a powerful microcomputer having significant memory and VO functions. Them is muuibyte 
data flow within a reduced instruction set (RISC) architecture Each PME has 18 bit internal data flow and 
i**"* 18 «» use of working and general registers to manage date flow. There 
is a errant ewncbed and store end forward mode for I/O transfer under PME software control The SIMD 

30 ^^ r MIM ™^ Is under PME software control. The PME can execute RISC Instructions from either 
the Bd ■» a SirvfO mode, or from ria own main memory in MUD mode. Specific RISC instruction code 
pomte can be retntorpwted to perform unique functions in the SIMD mode. Each PME can implement en 
extended Instruction Set Architecture and provide routings which perform macro level Instructions such aa 
extended precision fixed point arithmetic, floating point arithmetic vector arithmetic and the Eke TM» 

aa permits not only ccmpi™ ma* to be handled but image processing activities tor dkplay or imago date in 
^SSH^T^H W hn( * W) ** to wncaiOTO. The system can select gro^s « 

PMEs for a functon. PMES assigned can ettocate selected data and instructions for group r^oceaSr^Tne 
ofjeratow can be edema* monitored via the Ba Each BO has a primary control input, a secondary 
co^ input and a states monitor output for tee node. Within a node the 2o PMEs can be^ecton f«I 

* ^ t VZ£* Ottwunicsicn network within the chip. GwmuniccSon between PMEs controlled by 
t^iSSj^L^* under control of PME software. Thfe permks (he system to have a virtual 
SfE^Ll??^ E can step messages up or down Its own right or to Its neighboring PME in 
either of two adjacent rings. Each interface between PMEs Is a point-to-point connection. The K> ports 
pemtfj off-chip extensions of fee internal ring; to adjacent nodes of the system. The system is built up of 

«s reparations of a node to form a node array, a cluster, and ctfier conflguraftoos. 

To complement our system's SIMD, MWO. SlMO. MtMD and SIMIMD functionality, our development we 
haw provided unique operaBonal modes. Among our SIMD/MIMD PMlTs unique modes are [he new 
tenewnd features referred to as the "store and forward / circuit swtetv functions. These hardware functions 
comptemented with tee on chip communication and programmable Internal and external I/O routing oravktoa 

» tee PME with v«y optimal data trenaterrlng caprttlrty. In preferred mode of operaltonT'prc^S 
memwy la oneraRy the data sink ** messages and data targeted at the PME in the store and forward 

^ S^^^tl?^** , !L PME v * sem to *» output penmen In 

chcutt switched mode. The PME software performs the selected routing path while giving (he PME n 
dynamically selectable store and toward / circuit switch functionality. 
" * 3 «dv«>c«» v.* base provided is a fully distributed architecture tor PMEs of a node. Each 

IZTT^'^T 0 ^ H™* ^ mE m ^ ^ 3 .ng copeWBty with 
SimlaSJ TnSSiST St °?f- ^ "» fc^*^ switch logic. PME to PME 

communication. S1MCVMIMO switehtng capabilities, programmable routing, and dedicated floating point 
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assist logic. The organization of every PME and its communication paths with other PMEs within the sams 
chip to minimize chip creasing delays. PME functions can be Independently operated by the PME and 
integrated vwfr functions In the node, a cluster, and larger airays. 

Our massively parallel system Is made up of nodal building blocks of muft processor nodes, dusters of 

s nodes, end arrays of PMEs steady packaged In clusters. For control of these packaged systems whe 
previa* a system array director wnicn wtth the hardware controllers performs the overall Processing 
Memory Element iPMEj Array Controller functions in the massively parallel processing environment The 
Director comprises of three functional areas, the Application Interlace, the Cluster Synchronizer* and 
normaly a Cluster Controller. The Array Director wJD have the overaD control of the PME array, using the 

70 broadcast bus and our zipper connection to steer data and commands to all of the PMEs. The Array 
Director femcitori9 as a software system interacting with the hart/tare to perform the role as the often of the 
APAP operating system. 

lite interconnection for our PMEs tor a massively parallel array computer SDVIQ/MIMD processing 
memory element (PME> imerconneceon provides the processor to processor connection In the massively 

fs parallel processing environment Each PME uaizee our fully distributed Werprocessor communication 
hardware from Ihe on-chip PME to PMi connect k>n k to the off-chip I/O facilities which support the chip-to- 
chip inte* connection. Our modified topology fim&s our cluster to cluster wiring vrhUe supporting the 
advantages of hyporcube connections. 

The concepts which we employ for a PME node are related to the VLSI packaging techniques used for 

so the Advanced Parallel Array Processor (APap) ccmputer system disclosed here, which packaging features 
of our invention provide enhancements to the monutecmring ability of ihe APAP system. These techniques 
ere unique In the area of massively parallel processor machines and wrD enable the machine to be 
packaged and configured in optimal subsets that can be built and tested. 

The packaging techniques lake advantage of (he eight PMEs packaged in a single chip and arranged m 

ss a N^amensional modified hypercube configuration. This chip level package or node of the array to the 
smallest building block in the APAP design. These nodes are ihen packaged in en 8 x 8 array where the +- 
X and the +-Y makes rings within the array or cluster and the +-W> and +-2 are brought out u the 
neighboring dusters. A grouping of clusters make up an array. The intended eppficetfons tor APAP 
computers depend upon the particular cc4&$ijraaon and host Large systems attached to mainframes with 

so effective vectorized Hosting point processors might address special vectorizabte problems - such as weather 
predrc&on, wind funnel simufaaon, turbulent fluid modeling and trnse element modeling. Where these 
problems rnvohre sparse rnairtces, significant work must be done to prepare ihe data for vectorized 
arithmetic and likewise to store results. That workload would be off leaded to fhe APAP. In InlermacBate else 
systems, the APAP might be dedicated to perfonning the graphics operations associated with visualization* 

ss or with some preprocessing operation on Incoming data {jua^ performing optimum assignment problems in 
military sensor fusion applications). $man systems attached to workstations or PCs might serve as 
prograrnrner development staaons or might emulate a vectorized floating point processor attachment or a 
3d graphics processor. 



40 BRIEF DESCRIPTION OF THE DRAWINGS 

HO. 1 shows a parallel processor processing element like those which would utfflze old technology. 

RG-2 shows a massively parallel processor building block In accordance with our invention, 
representing our new chip design. 
46 HG. 3 illustrates on fts right side fhe preferred cup physical cluster layout for our preferred 
embodiment of a chip single node fine grained parallel processor. There each Chip b a 
scalable parallel processor chip provicBng 5 MiPe performance with CMOS DRAM memory 
and logic perrniaing air cooled implementation of massive ooncunrem systems. On the left 
side of Figure 3, there Is illustrated the replaced technology. 
30 FIG. 4 shows a computer processor functional block diagram In accordance with the Invention. 

FIG. 8 shows e typical Advanced Parallel Array Processor computer system coirfiguretion. 

FtO. 6 shows a system overview of our flne-grained parallel processor technology in accordance 
with our invention, iDustraUng system build up using replication of the PME element which 
permits systems to be developed with 40 to 193,840 MPS periotfiance. 
ss FIG- 7 illustrates the hardware tor Ihe processing element (PME) data flow and focal memory in 
accordance with our invention, while 

FIG. & illustrates PME data How where a processor memory element is configured as a hardwired 
general purpose computer that provides about 5 MIPS fixed point processing or A MHops via 
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prograrnmed control floating poiTri operations. 
FIG. 9 shows the PME to pme connection (btay hypercute) aid data paths thai can be taken In 

arx*retence with out invsntiotii *yhHe 
FIG. 10 ilustrates node mtefconnecfems for tie chip or node which has 8 PMEs. each of which 

manages a single external port and permits dWrlbuiion of the network control function end 

eliminates a functions! hardware port botaeneck. 
FIG. 1 1 Is a block diagram of a scalable parallel processor chip where each PME is a 16 bit wWa 

processor *nh 02 K words of local memory and there la I/O porting for a broadcast port 

yrWch proves a controilef 4o-5rfl Interface wMie external ports are bWtec&onal pdrrwo-pow 

mterfaces permitting ring torus connections within the chip and externally. 
Fl-3.12 shows an array director m the preferred ernboolment 

FIG. 13 In part (a) i^strates the system bus to or from a duster array coupling enacting loading or 
umoadrg of the array by conrveottng the edges of clusters to the system bus {see FIGURE 
10 FIGURE 13 m pen <&} there is the bus Ktfrom the processing element portion. 
FIGURE 13 ifkistrafes how multiple system buses can be supported with rnuffipb dusters 
Each duster can support SO to 57 Mbyte's bandwidth.. 

fig, 14 shows a'ztpper corwecfcm for fast VD correction. 

FIG. te shows an 0 degree rrypercub* connection Hlustnaing a packaging technique In accordance 

wttt our m vention applcabie to an 8 degree hypercube. 
FIG. 19 shows two Independent node oonnecSona In the hypeicube. 

RQ ' 17 f*** 5 *■ Sort algorithm as an example to ilfcstraa ihe advanteges of the defined 

SlMDAfflMD processor system. 

FIG. ta (lustra** a system block diao/am tor a host attached large system with one apportion 
processor interlace iBuatrated. This nustrauon may also be dewed with to uwterstending 
that our invention may be employed In stand alone systems which use multiple application 
Such interfaces in a FIGURE 18 configuration wffl support 
OAStXGrapWcs on all or many dusters. Workstation accelerators can elirninate me host 
appilceBon processor inierfece {API) and cluster syncnronteer (OS) illustrated by emulation 
TheCStenotroo^insdine8ine4aj«<^ 

FIG. 19 llustrates fee software developrnent envfrownent for our system. Programs can be prepared 
by and executed from the host application processor. Both prog/am arid machine debug is 
supported by the wortetceon based console frustrated here and in FIGURE 22. Both of these 
services wtl support appOcailona operating on a real or a almuJated MM P. enabfrig appbca- 
^r^^ d !f k * >ed * a "N"*** 0 level as wen as on a supercomputer formed of the 
APAP MMP. The common software ertfrorvnent enhances prograrnmabilHy and distributed 
usage* 

FIG. 20 ilustretes the progranrnrng levels which are permitted by the new systems. As different 
^require more or tew detailed knowtedg* the software system is developed to support 
this veriebon. At the highest level the user does not need to know the architecture is indeed 
en MMP, The system can be used with existing language syetome tor patiMtog of 

programs, such as parallel Fortran. 

FIG. 2i ilustretes the parallel Fortran compiler system for the MMP provided by the APAP conigura- 
Hons described A sequential to parallel compiler system uses a cornbinaoon of existing 
compter capability with new data allocation functions and enables use of a partitionirra 
program race Fomana ^ ^ 

FIG. 22 ilustretes the workstation applcation of me APAP, where the APAP becomes a workstation 
accelerator. Note that the unit has the same phyoical size aa a R1SGB00O Model 530, but 
iWs model now comers an mmp which is eitacbed to the workstation vie a bus extension 
module ffiustrsted, 

FIG. 23 lluskates an application far an APAP MMP module for an AWACS military or commercial 
application. This is a way of tending efficiently the classical distributed sensor fusion 
problem shown in FIGURE 23, where the observation to track matching is classically done 
with ^Jmtw algorithms like nearest neighbor, 2 drmensionai linear assignment (MunkesA 
prcoafelfetlc data association or multiple hypothec testing, but these can now be done in an 
irnproved manner as illustrated by FIGURES 2* end 25. 

FIG. 24 Mustrates how the system proves the abfflty to handle rnjlmensional assignment problems 
In real erne, 

FIG. 25 Mustratee processing flow for an rKlmensionaJ assignment problem utifemg an APAP. 
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RO 2fi illustrates the expansion unft provided by the system enclosure described shewing how a 
unti can provide 424 MftopS or 5120 MIPS using only 8 to 13 extended SBWE modules, 
providing me pertorrnance comparably to that of epedalzed signal pt ooeesor module in only 
.6 cubic tool This system can become a SIMO massive machine frith 1024 parallel 
s processors performing two hi I Bon operations per second (GOPS) and can grow by adding 

1024 addraonal processors and additions] storage, 
RO- 27 Wustrates me APAP packaging tor a supercomputer. Hero is a large system of comparable 
performance but much smaller footprint than other systems. It can be built by repifcattng the 
APAP cluster wttnin an enclosure tea moss used tor smaller machines. 
Vie have provided, as part of ihe description, Tables flustraling ihe hardwired instructions for a PMG, in 
which Table 1 illustrates Fked-point arithmetic instructions: Table 2 fliustrates storage to storage instruc- 
tions; Table 3 iftjstrtsee logical insmjcSons; Table 4 illustrates ehft instrucfions; Tabte 5 Illustrates branch 
Instructions; Table 6 illustrates the status Bwftching instructions; and Table 7 illustrates the inputfoitput 
Instructions. 

;s [NotK For convenience of MJu3traa*on In the formal patent drawings, FIGURES may be separated in parte 
and as a convention we place Ihe top of the FIGURE as ihe first sheet with subsequent sheets proceeding 
down and across when viewing the FIGURE, in the event that muHiple sheets are used.) 

Our detailed description follows with parte explaining the preferred srobodrnente of our Invention 
provided by way of example. 

so 

DETAILED DESCRIPTION OF THE INVENTION 

Turning now to our inven&on In greater detail, it will be seen from FIGURE I, which Illustrates me 
existing technology levet illustrated by (he transputer T00O chip, and representing similar chips for such 

** machines as the illustrated by the Touchstone Delta (1580% N Cube ('383). and others. When FIGURE 1 ts 
compared with ihe developments here, it win be seen mat not only can systems like me prior systems be 
substantially improved by employing our invention, but also new powerful systems can be created, as we 
wffl describe. FIGURE i T s conventional modem microprocessor technology consumes pins and memory, 
earrtritfh is limited and inter-chip communication drags the system down. 

so The new technology leapfrog represented by FIGURE 2 merges processors, memory, IrD into murttpte 
Pft/SEs (eight or more 16 b'rt processors each ot which has no memory access delays and uses all the pins 
lor networking) formed on a single low power CMOS ORAM chip. The system can make use of ideas of our 
prior referenced disclosures as wel as Invention separately described In the applications riled conctinemly 
herewith and applicable to Ihe system we describe here. Thus, for this purpose they are Incorporated herein 

38 by reference. Our concepts of grouping, autonomy, transparency, zipper interaction, asynchronous SiMD, 
&MIMD or StMD/MiMD. can ail bo employed with the new technology, even ftough to lesser advantage 
they can be employed in the systems of the prior technology and in combination with our own prior muhrpis 
picket processor. 

Our picket system can employ the present processor. Our basic concept Is ttat we have now provided 

<o a replicable brick, a new basic buttling block for systems van our new memory processor, a memory unit 
having embedded processors, router and rVO. This bask: bidding block is scalable. The basic system which 
we have implemented employs a 4 Meg. CMOS DRAM, It is expandable to be used in larger memory 
conflgurattons, with i6Mbit DRAMS, and 64Mbit chfee by 6)pansion, Each processor is a gate array. Wfft 
denser deposition, many more processors* at higher clock speeds, can be placed on the same chip; and 

4$ using gates and additional memory win expand the performance of each PME. Scaling a single part type 
provides a system from work and archfiecture which can have a performance wsfl into the PETAOP rangs- 

RGURE 2 illustrates the memory processor which we- call the PME or processor memory element in 
accordance with our preferred embodiment The processor has eight or more processors, in the pictured 
embodiment there are eight The chip can be expanded (horizontally) to add more processors. The chip 

so can as preferred, retain tlte logic and expand the DRAM memory with additional csOs Bnearh/ (vertically). 
Pictured are 16 - 32fc by 9 Wt sections of DRAM memory surrounding a field of CMOS gale enay gates 
which Implement d replications of a id bit wide data flow processors. 

Using IBM CMOS low power sub-micron IBM CMOS deposition on silicon technology. It uses selected 
silicon wnh trench to provide eignmcant storage on a small chip surface. Our memory and multiple 

58 processors organized interconnect b made with IBM 1 5 advanced art of making semiconductor chips. 
However, it wia be recognteed that the little chip we describe has about 4 Meg. mernory. it is designed so 
ihat as 16 Meg. memory technology becomes stable, when improved ytekfe and methods or accommodat- 
ing defects are certain, our little chip can migrate to larger memory sizes each 9 bits wide wilhout changing 
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the logic. Advances in photo and X-ray lithography Imp pushing minimum feature dn to well below JS 
mwwiB. Oir design envisions mora progress. "n™* edvences penmli pJaoomem of vary lame amourte 
of memory processing on a single silicon cttp. T ^ 

s w .iT/f^ill 4 ^ C 1^ C Tf b ^ V ' d to * m ° 3^ memory chip wKh extensive room 
^T*Jl^^!Ll! ^ 1!*" D«™ ™*» ™*e "P <*e™ry array. The DRAM has 120K 

J^^^ ra ™- a ^ l rat ^T Sy T»» 35 ns W tes DRAM «x«8 time arches the 
£2? JErjT- Implementation provides logic density tor a very effective PE (picket) and 

„ ?ff ** ,B "> « waia ^ a» logic. The septate memory section of the cWp, sac* 32K by 8 

'JS ^^"t^^ft ^ ^ rf CMOS gate array gatee^^taK 

eels, end having the logic deaenbed to other figures. Memory le bordered aid with a separated I power 
* * ***** combining ot efenNfcant amounts of logic on S 
substrate wW i slgnfficant amounts a memory problems involved with D» electrical noise ineompatibilly ot 
^ T^,^!^!^ Ujgte tends lobe very noisy memwy needs retetlvTqUrt to 

^T^L? ^ 2?? ** "»* *™> «*9 the «fa ot DRAM. Yto prefer to provide t^nSed 

^L^SLl°lr ,<h ^ *" ***** oSfion end baS£ 

acWe» compstfbWty between logic and DRAM. * 

» APAP System Overview of Preferred Embodiments 

This description introduces the new technology to the foBowtog order: 
t. Technology 
2. Chip WW description 
a 3. networking and system build up 

4. Software 

5. Applications 

The initial sections of the detailed description describe how 4»Meg dram low powei CMOS chics are made 
» VTul *Z S ZZT ^ m0tm ™»*ctored PMEDRAM chip* ^h sl^r^ ** 
2. independent instruction stream and interrupt processing txv) 

at 9 bit tPnopamy end eonirats) wide external pert errf jntereowierton to 3 other on c«p processors. 
flfT^L pr0VM08 muKWa iunt5toft8 « Wegnaed into a single chip dea^mThto wMI 
^ ^ <Un ^ " hteh « *"d so so* that a c^T^SS 

a wi» be eifachve at processing, routing, storage and three classes ot W>. This chip has t^nr^nemiw 
end cenw togte wfthtn the single chip » rm*e the PME. end this eemtHnetteo is wS? JJSST 

A processor system is twill from replicators ot me single chip. ^ w m cn *' 

t» ^ parBtj °?' be ^ P"** DRAM. II my be fexmed as mufcpie word length (1 3) btt 

by 32K sections, associating one section ««h a processor. (We use the term PME to refer to a 

llTSS, •* * ^^oM interconnect po!hT (See 

S to^s p^fS? ° ^ teebnotogy. B«trating replication end the 

« ^ W ^o deSCri( *° !1 , addres8es P«>g»am types. Ai the lowest level processes 

^l^L f <0f ^ ^ ^ ^I*^' <o debited t«,^e iWneX 

mL^r^TJ^T^. 8 to *• li ° i"»wp«««ec,r syndirenbelon and is what 

^SLSL?S • ™^program for the MPP. An intermediate level of services provide for both mapptog 
apptkatons <developed woh vecto. or matr* operator} to the MPP, and atsoZtrol, svn^3S 

„ *Tt ^T^ MnS - * ^ (9VS '- ^ order langusgos am ^I^ted^ S^^te 

T^^^T 1 ^* &W ^ to ^ different d^ee^oS 

and opttrnt^lon wlthhi a single program. Thus, a user can code applcatlon programs wi^ot undereten^w 

Sections of our description that describe 1 024 element 6 GPS unite and a 33 783 element I6i ArPfl 

I^lS^r hft !? ^ **** 69 because the small unit isZtfTto 

rrucroproceesors (aoceterators,. personal computers, workstation and miliiary appHoaBons (using of ooarae 
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different packaging technique while the larger unit is iltostreave of a mainframe application as a module or 
complete supercomputer system* A software description wtll provide examples of otter challenging work 
mat m$ht be efaceveiy programmed on each of the Wustrative systems. 

3 PME DRAM CMOS - A BASE FOR A MULTIPROCESSOR PME 

FIGURE 2 illustrate our technology Improvement at ma chip technology level. This exusndabls 
compute* organization is very cost and performance efficient over the wide range of system sizes because 
ft uses only one chip type. Combining fta memory and processing on one cntp eliminates the pine 

io dedicated to the memory dub and their associated reliability and performance penalties. Repfcation of our 
design within to chip makes It eocramregjy feasible to conskJei custom logic design* for processor 
subsections- Replication of the chip within the system leads to large scale manufacturing economies. 
Rnaity. CMOS technology requires low power per Ml P. which in turn mrnirntees power supply and cooling 
needs. The cmp architecture can be programmed for multiple word lengtbs enabtog operations to be 

;s performed that would otherwise require much larger length processors. In combinaaon these attributes 
permit the extensive range of system performance. 

Our new technology can be compared wHh a possible extension of tie old iectooJogy it overlaps, life 
apparent that the advantages of smaller feafcres have been used by processor ignore to construct more 
complex chips and by memory designers to provide greater replication of me simple element D the trend 

20 continues one couW expect memories to get four times as large white processors might exploit density to: 

1. include muWpte execute units with fcnwtruction routers, 

2. increase cache sizes and associative capebffity cnovor 

3. Increase instruction look ahead and advance computation capability. 

However, these approaches to Ihe okJ technology illustrated by FIGURE 1 all tend to dead end. 

20 Dupficatrng processors leads to linearly increasing pin requ&ementa but pine per chip ts fixed- Better cache- 
ing can only exploit the application's dale reuse pattern. Beyond that, memory bandwrdfli becomes the Omit 
Application data dependencies and branching HmH ihe potential advantage of look ahead schemes. 
Addlttonally, It Is not apparent that MPf> aprjicetJorts with fine-grained parallelism need 1. 4 or 16 
btegawofd memories per processing unit Attempting to share such large memories between multiple 

ao processors results In severe memory bandwidth limitations 

Our new approach is not dead ended. We combine both significant memory and I/O and processor into 
a single chip, as illustrated by the FIGURE 2 and subsequent illustration and description. It reduces pert 
number requirements and eliminates the delays associated with chip crossing. More Importantly, this 
permits all the cWp T s KD pins to be dedicated to btierprocessor conTmunicstkai and thus, maximizes 

as network berrtmdth. 

To Implement our preferred embodiment Illustrated In FIGURE 2 we use a process dial Is available now, 
using IBM low power CMOS technology. Out illustrated embodiment can be made with CMOS DRAM 
density, in CMOS and can be implemented in denser CMOS. Our illustrated embodiment of 32K memory 
cells for each of 8 PMEs on a chip can be Increased as CMOS becomes denser. In our embodiment we 

40 utjfee the real estate and process teefrttiogy for a 4 MEG CMOS DRAM, and expand this with processor 
replication associated with 32K memory cn (he chip itself. The chip, it will be seen has processor, memory, 
and W In each of the chip packages of the cluster shown In FIGURE a Within each package Is a memory 
w&h embedded processor element, router, and I/O. all contained to a 4 MEG CMOS DRAM believed to be 
the first general memory chip with extensive room for logic, ft uses selected silicon wilh trench to provide 

46 significant storage on a small chip surface. Each processor chip of out design alternatively can be made 
wfth 1 8 repttcstions of a 32K by 9 bit ORAM macro ($5/80 ns) using -07 micron CMOS logic to make up the 
memory array. The device is unique in that It allocates surface area for 120 K cans of application logic on 
me chip, supported by fte capability of triple level metal wiring. The multiple cards of the old technology is 
shown crossed out on the left side of FIGURE 3. 

no Our basic repDcable element brick technology Is en answer to the old technology. If one considered the 
"Xtd* technology on the ten of FIGURE & one would see too many chips, too many cards, and waste. For 
example, Today's proposed teraflop machines that orJrere otter would have literally a million or more chips in 
them. With todays other technology only a few percent of these chips, at best, are truly operations 
pttrtuc&rs. The rest are "overhead 1 * ttyprcany memory, network interface, etc.). 

5s It will become evident that it is not feasible to package such chips, in such a large number, in anything 
that must operate in a constrained environment of physical stee. (How many could you ft* in a smal area of 
a cockpit?) Furthermore, such proposed feraiiop machines of others, already nuge, must scale up lOOGx 
times to reach the petaop range. We have a solution which dramatically decreases the percent of non- 
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operations producing cttps. Yfe provide increased bandwidth. We provide this within a reasonable network 
drrrtertsioriafily. VWth such a brick technology, whore memory beoomas trie operator, and networks are used 
tor passing control where coercions prooucing chips e*e drematicafy increased* to addition, the upgrade 
otenaHcaiy reduces the number of different types of chips. Our system is designed for srjate-up, without a 
reqttrement fer specialized packaging, cooing, power, or envkonmental constrakits. 

V*tn our brick xecsxrtow. usreing instead of separate processors, memory units with bum in 
processor* and network capability, me corrftgunnto shown (n FIGURE 3, representing a card, wHh chips 
which are pin computable with current 4Mbit DRAM cards at the connector level Such a single card could 
how, wlm a design point of a basic 40 mfe per chfe pertorTftence level, 32 chips, or 1280 imps. Four such 
cards would provide t> gips. The workstation configuration which is flrustrated woupd preferably have such a 
PE memory array, a cluster controller, and an IBM RISC System/6000 which has sufficient performance to 
run and monitor execution erf an array processor application developed at me wortsfaftorv 

Avery gate efficient processor can be ueed si the processor portion. Such designs tor processors have 
been employed, but never witton memory. Weed, in addition, we have provided the ability to mix mimd 
and SIMD baste operation provisions. Our chip provides a broadcast bus" which provides an alternate path 
bun each CPU's instruction buffer. Our cluster corifaioilw tesu^ 

and these can be stored in the PME to control their operation in one mode or another. Each PME does not 
have to store an entire program, but can stare onty those portions appHcabb to a given task at various 
limes during processing of an applcailcn. 

Given the basic device one can elect to develop a stogie precede* memory combination. Afemaavefy 
^uss^anioresrmpre processor end a subset ite 

Z'^^t^J^ 1 ^ 0 ^ ***** <PME> ' can be made simpler either by adjusting 

the dataflow bandwidth or by su&awuang processor cycles tor functional accelerators. For most embodK 
menta we prefer to make 8 recreations of ihe basic process! rig element we describe, 
1 ^ indicated that ^ now the rnoct favorable answer is e rcplcations of a ie 

bit wide data flow and 32K word memory. We conclude this because: 
1 . 16 bit words permit single cyote fetch ot instructions and addresses. 

2. a PMEs each w*h an external port permits 4 dlmensW torus interconnections. Using 4 or 8 PNEs 
on each ring leads to modules suitable lor the range of targeted system perfermantes, 
3.8 external ports requires about S0% of foe chip pine, providing sufficient remainder tor power around 
and common control signals. ^ 
4, 3 Processors implemented In a 64 KByte Metn Store 
a. allows tor a register based architecture rather than a memory mapped amNtecture, and it 
b^tocee some desirable but not required accelerators to be implemented by rrufirpie processor 

T^l^T* K 01 m ****** ***** Our new 

separators (ex froatmg point a-ithmetic unit per PME) are added as cn*p hardware wKhout affect** 
system design, pins and cables or application code. 9 
T^resuiterrt chip layout and ate* x 1 4*3 mm) Is shown in FIGURE 2, «md FIGURE 3 shows a 
cluster of such cfttoo. which can be packaged in systems ike those shown in let* figures tor stand atone 
unite, workstations which slkte next to a workstation host with a connection bus, in AWACs applications, and 
* ^^P^f- ^^t^***** a of system level advantages. B permits 

**» ****** MPP * **** replication of a stngto part type. The two DRAM macros per 
pmc^provir^ sutSoent storage tor both data and program. An SRAM of equivalent sue might consume 
rZT^ll ^ ™ e po ? er ^ a^wtage permits mud machine models rather than the more 
r^l^ characteristic of machines with single chip processor/rnemory designs. The 35 ns or 

leas DRAM access toe matches the expected processor cycle time. CMOS logic provides the logic densitv 
tor a very effective PME and does so while diss^aanrj only 1.3 vratta. (Total chip power i9 1.3 + .9 

SrX^J"^- Th !2Jf t ^ ^V" pOTT * usm * ™* ln applications requiring conduction 
cocftig. (Air cooing in non-MIL applications is slgnlficanay easier.) However, the air cooled embodiment can 
be used far workstation and other environments. A standalone processor mtgnt be contoured wrih an 80 
amp- 5 volt power suppry, y w 

ci^^rff^ P * aM ****** <APAP > btocfcs are shown In FIGURE 4 and In FI3URE 5 

TOURC 4 flustraes ma ajncitonai block diagram of me Advanced Parana Array Processor. Mum* 
^r^pfi^ 15<> ' 170, 180 ^ Wlfcalion processor 100 or lessors I I^So 

i^^apT^ ^ ^ ^ can * ^ dSerent system JS 

dteorams. The APAP. si a maximum configuration, can incorporate 32,768 identical PMEs The processor 
consists of the PME Array 280, W0. 300. 310. an Array Director 250 and an 
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260 for the application processor 200 or processors 210, 220, 230. The Array Director 250 consists of three 
functional unite; Appfiteiton Processor Interface 200. duster 6ynchronizor 270 and dueler CemsroSer 270. 
Ad Array Director can perform the functions of the array controller of our prior linear picket system for SIMD 
operations with MIND capability. The cluster corriroiter 270, along with a est of 64 Array clusters 200. 290. 

s 300. 310. (La duster of 512 PMEs). is die baste building block of the APAP computer system. The 
elements of me Array Director 250 permit configuring systems writ* a wide range of duster replications. This 
rnodular&y based upon strict repttcafion of both processing and control elements is unique to this massively 
parallel computer system. In section, the Application Processor Interface 260 supports the Test/Debug 
device 240 which wli accomplish important design, debug, and monitoring functions. 

10 Controllers are assembled wirh a wad-defined interlace, e.g. iBMs Mkrochannel, used in ofcer systems 
today, including comroflers wtth «8u processors. Hold rxc^rarnmat^ gate arrays add functions to the 
controller which can be changed to meet a particular configuration's requirements (how many PMEs Stere 
are, their caupBngs, etc) 

The PME arrays 260, 290> 300, 310 contain the functions needed to operate as etfei $D*D or M1MD 
/s devices. They also contain functions that permit the complete set of PMEs to be divided into 1 to 256 
distinct subsets. When divided Into subsets the Array Director 250 Interleaves between subsets. The 
sequence of the interleave process and the amount of control exercised over each subset is program 
controlled. This capability to operate tfsunct subsets of the array In one mode, Ke., MMD vrfth dhlering 
programs, while oilier sets operate In rightly synchronized SIMD mode under Array Director control, 
3> represents an advance in the art Several examples presented later iDuerae me advantages of the concept. 

Array Architecture 

The set of nodes forming the Array is connected as a n-cfimensional modified hyper cub©. In that 
£9 Interconnection scheme, each node has owed connections to 2n other nodes. Those connections can be 
either simplex, hatt-duplex or tuD-dupiex type paths. In any dimension greater than 3d, the modified 
hypercube is a new concept In Irrterconnealon techniques (The modified hyperoube in the 2d case 
generates a torus, and in me 3d case an crthogonaBy connected latSce wHh edge surfaces wrapped to 
opposing suifsoa.) 

30 To describe the interconnection scheme for greater than 3d cases requires an inductive description. A 
set of mi nodes can be imeroormecsed as a ring. (The ring could be 'simply connected*, 'braided', 'cross 
connected', 'fully connected*, eto. Although additional node ports are needed for greater Stan simple rings, 
that added complexity does not affect Hie modified hypercube structure.) The ma rings can ihen be linked 
together by connecting each equivalent node in fre m* set of rings. Tlie resuR at this point is a torus. To 

ss construct a 1+ Id moti&fted hypercube from an Id modified hypercube, m* i sets of Id modified hypercubes 
and Interconnect si of the equivalent mi level nodes into rings* 

Trria process is fciusuated icr the 4d modified hypercuoe, using mi » 8 for i « i„4by the illustration m 
HQURE & Compare our description under node Topology and also FIGURES 6, 0, 10, 15 and 10. 

FIGURE 6 illustrates the fme-graioed parallel technology path from the single processor element 300, 

<o made up of 32K »6-brt words witn a 16-bit processor to the Network nods 310 of eight processors 312 and 
their associated memory 31 1 with their fully distributed I/O routers 313 and Signal \fO ports 314, 315. on 
through groups of nodes labeled clusters 320 and into the cluster configuration 360 and to the various 
applications; 330, 340, 350, 370. The 2d level structure is the cluster 320, and 6* clusters are Integrated to 
form the 4d modified hypercube of 32,708 Processing Elements 360. 

Processsig Array Element (P(v£) Preferred Embodiment 

As IBustiated by FIGURE 2 and FIGURE 11 the preferred APAP has a basic buUcfing block of a one 
chip node. Each node contains 0 identical processor memory elements (PMEs) and one broadcast and 

so control interface (BCty White some of our inventions may be implemented when all functions are not on the 
same chip, it is important irom a performance and cost reduction standpoint to provide the chip as & one 
chip node wfth the 6 processor memory elements using the advanced technology which we have described 
and can be implemented today. 

Tne preferred implementation of a FME has a 64 KByte main store, 16 18-bit general registers on each 

58 of 8 program interrupt levels, a full function ajfthmetic/logic unit (ALU) with working registers, a statue 
register » and four programmable bKRrectonaJ I/O ports, in addition me preferred implementation provides a 
SIMD mode broadcast interface via the broadcast and control interface (BCI) vmich allows an externa) 
controller (see our original parent application and the description ol our currently preferred embodiment for 
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a nodal array and By stem wnh cfostersj to drive PME operation decode, memory address, and ALU data 
Inputs. This cWp can perform (he function* of a microcomputer allowing multiple parasol operations us be 
performed with* it and It cart be coupled to other chips wftWn a system of multiple node©, whether by an 
»rorcgnnoction network, a mesh or hypercubo netiwrk, or our preferred and advanced scalable ernbodf- 
men! 

The PMEa a/a totercrjonected In a series of rings or tor! m our preferred scalable embodiment. In some 
applications the nodes couW be siterconnected in a mash. In our preferred embodiment each node contains 
two PMEs In each of four tori. The tort are denoted WXY, and Z (see FIGURE 6). FK3URE 1 1 deplete the 
^^"f^ • node- The two PMEs in each torus are designed by the* eternal K> 

portf+W,-W, +*-X, +Y.-V, *Z, -2), Within Ihe node, there are afco iwo rings which interconnect foe 4 
+ n and 4 -n pmes. Theoe htemai nags provide the path tor messages to move between the external tort. 

Since As APAP can be In our preferred emfarfment a tour dfenenstonal orthogonal array, the Internal 
rings allow messages to move throughout the array in an dimensions. 

The PMEs are aelfroontained stored program microcompufers comprising a main store, local store, 
oparsaon decode, siihrrietic.log ic unit (ALU), working registers and IrpuVOu^ut I/O ports. The PMEs have 
the capablSty of fetching and executing stored Insuudions from tfieir own main store in MMD operation or 
to fetch and execute commands via the Bd Interface m SIMD mode. This mterfece permits k^ercommiMii- 
cetion among the controller, th© PME, and other PMEs In a system made up of multiple ctfpe. 

The BO to the node's Interface to the external array controller element and to an array drecton The 
Bd provide* common node taction* euch as timers and docks. The BO provide* broadcast function 
masking lor each nodal PME and provides the physical interface and buffering for *» broadcaet-bus^D- 
PME data Masters, and also provides *e nodal interface to system status and monitoring and debug 



Each PME contains separate interrupt levels to support each of its poir*to^x*!t interfaces and She 
broadcast Interlace- Data Is input to the PME main store or output from PME main store under Direct 
Mernory Access (DMA) control. An -input transfer complete* interrupt is available for each of the interlaces 
to signal the PME software that data is present Stak* information Is available tor the software to determine 
the compfetor of data output orperetons. 

Eacn PME has a "circuit ewfcned mode* of I/O in wnteh one of its four input pone can be switched 
directly to ones of Us four output ports, without having the date enter ihe PME main store, Section of the 
source and destination of the 'circuit switch" is under control of the softie executing on the PME. Tlie 
other three input ports continue to have access to PME main store functions, white the fourth incut is 
switched to an output perl 

An addrtjonai type of iVO has data that must be broadcast to, or gathered from all PMEs, plus data 
which la too specialized to fit on the standard buses. Broadcast data can induce SIMD commands MIMD 
pmgrarns. and SIMD data. Gathered data is pAnarOy status and monitor functions. Diagnostic and test 
lunette* are the specialized data element* Each node, in addition to the included set of PMEs. contains 
one Ba During operaUons the 6GI section monitors the broadcast interlace and steer^ltecis broadcast 
data to/from the addressed PMEis). A combination of enabling masks and addressing tees are used by the 
8CI to determine what broadcast Jnforrnattori is intended tor which PMEe. 

Each PME is capable of operating in SIMD or in MIMD mode in our preferred ernborftrnerrt In SIMD 
mode, each Instruction is fed into 3xj PME from the broadcast bus via the B&. The Bd buffers each 
broadcast data word untt at of its selected nodal PMEs have used it TNs syricrtrcnizatton provides 
accommodation of the data f ming dependencies associated wilh the execution oJ SIMD commands and 

aaowe asynchronous operations to be performed by a pme. in mimd mode, each pme executes its own 
program from its own main store. The PMEs are NtialUsed to th* SIMD mode. For MIMD operattns, the 
external controller normally broadcasts [he program to each of the PMEa while hey are in SIMD mode,' and 
then commands the PMEs to swtch to MWD mode and begin executing. Mesklng/iegging the broaden 
Mrormattn allows dMterwit sets of PMEs to contain different MIMD programs, andor selected seta of PMEs 
to operate In MIMD mode while other seta of PMEa execute In SIMD mode. In various software clusters or 
partitions these separate function* can operate independently of the actions in other clusters or partitions 
?* l*« lnrtruo,lOT Se! OSA) of the PME wHI vary slightly depending on whether 

toe PME is in the SIMD or MIMD mode. Most ISA instructions operate IdertfcaJly regardless of mode. 

^^^f 6 P . ME m ^H° mow M pertom or other control functiona acme code 

points dedicated to ihose MIMD msiructlons are reinterpreted in SlMO mode to alow the PME to perform 
scectei operations such a* searcntog main rnemor; for a match to a broadcast data value or switching to 
MWD mode. Thfe further extends system flexibility of an array. 
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PME Architecture 



Basically, our preferred srchiiechrre comprises a PME which has a 10 bit wide data flow, 32K of 10 bit 
memory, spcrialjzod I/O porte and I/O switching paths, plus Wis necessary control logic to permS each PME 

5 to fetch, decode end execute the 18 bit Instruction set provided by our instruction set architecture- (ISA). 
The preferred PME performs toe funcsons of a virtual router, and ftus performs bom the processing 
functions and data rouzsr functions. The memory organization allows by cross addressing of memory 
between PMEs access to a ferae random access memory, and direct memory tor the PME. The individual 
PME memory can be afl local, or divided into local and shared global areas progjammsacalry. Specialized 

to controls and capabilities which we describe permit rapid task switching and retention of program stale 
information at each of the PMEs interrupt execution levels, Although some of the capabilities we provide 
have existed in other processors, their appftcafion for management of interprocessor VQ is unique to 
massively paralel machines. An example is the integrate of the message router function into the PME itself. 
This elrninates specialized router chips or development of spectelzed VLSI routers* We also recognize that 

;s in some instances one could distribute the functions we provide on a single chip onto several chips 
Interconnected by metafizatlon layers or otherwise and accomplish Improvements to massively parallel 
machines. Further, as our architecture b scalable from a single node to massively parallel supercomputer 
level machines, H Is possible to utilize some of our c on cept s at different levels. As we will illustrate for 
example our PME data flow is very powerful, and yet operates to make (he scalable design effective 

so The PME processing memory element develops for each of the mu$pJ* PUE $ o* a node, a fuBy 
distributed architecture. Every PME wUl be comprised of processing capebifty wtth 16 bit date flow, 64K 
bytes of local storage, store and totwardfclrcuri switch logic PME to PME communication* Simd/mimd 
switching capabilities, programmable routing, and dedfcet&d itoaang point assist logic- These functions can 
be independently operated by the PME and integrated with other PMEs within (he seme chip to minimize 

£s chip crossing delays. Referring to FIGURES 7 and 8 we Hsjstrate the PME dataflow. "The PME consists of 
16 bit wide dataflow 425. 435, 445, 455, 465. 32K by 16 bit memory 420, specialized I/O poris 400, 410, 
460, 460 and I/O switching paths 425, plus the necessary control logic to permit the PME to fetch, decode 
and execute a 16 bit reduced instruction set 43a 440, 450, 480. The special logic also permits toe PME to 
perform as both the processing una 460 and data router. Specialized controls 405, 406, 407. 408 and 

3D capebtilttea are Incorporated to permit rapid task switching and retention of program state Information at 
each of the PMEs" interrupt execution levels. Such capabilities have been included in other processors; 
however, their application specifically for management of tnterprocessor I/O is unique in massively parallel 
machines. 6pecrficaify, tt permits the integration of the router function Into (ha PME without requiring 
specialized chips or VLSI development macros. 

38 

16 bit internal data flow and control 

Tho major parte ol the Internal data flow of the processing element are shown In FIGURE 7. FK3URE 7 
illustrates the internal data flow of the processing element This processing element has a ful 16 bit interna) 

40 data How 425. 435. 445\ 455, 465, The important paths of the internal data flows use 12 nanosecond hard 
registers such as tha OP register 450, M register 440, WR register 470, and (he program counter PC 
register 430. These registers feed the may distributed ALU 490 end I/O router registers and logic 405. 406. 
407. 400 for al) operations. Wfth current VLSI technology, the processor can execute memory operations 
and instruction steps at 25 Mrur. and it can build the important elements, OP register 450. M register 440, 

* WR register 470* and the PC register 430 *rth 12 nanosecond hard registers. Other required registers are 
mapped to memory bcaaons. 

As seen In FIGURE 8 Die internal data Bow of the PME has its 32K by 16 bit main store In the form of 
fwo DRAM macros. The remainder of the dsia flow constats of CMOS gate array macros. All of the memory 
can be formed wrto the logic wfth low power CMOS ORAM deposition techniques to form an very large 

so scaled Integrated PME chip node. The PME Is repDcated 8 times In the preferred embodiment of the node 
chip. The PME data flow consists of a 1$ word by 16 bit general register stack, a multi-function 
artthmottologic unit (ALU) working registers to buffer memory addresses, memory output registers, ALU 
output registers, cperatton/tommand, I/O output registers, and multiplexors to select inputs to the ALU and 
registers- Current CMOS VLSI technology for 4MByte DRAM memory with our logic permits a PME to 

55 execute instruction steps at 25M)u. We are providing the OP register, the M register, the WR register and 
the general register stack with 12 nanosecond hard registers. Other required registers are mapped to 
memory locations within a PME. 
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The PME dais fbw is designed as e 10 M Mager arithmetic processor. Sfteaai mulfiplexar pat» hav» 
bean added to optimize subroutine emulation ol n x 18 N floating point operattone m=>l). The 16 bit data 
flow pjnwte effec**. emulation of floaflng point ope.a3ons. Specific paths within she data flow have been 
included to permit floating point operations in as Utile a 10 cycles. The BAMudes speak* code pebt to 
pern* subroutines tor extended (longer hen 18-bit) operand operations. The subsequent floating point 
P*rt»n>««» te W«lmfltely one twentieth file fixed ftoabng point performance. This partem*** is 
arferjRto to eliminate the need tor special floating point chips augmenting the PME as is characteristic of 
ofiwri neatttfy paraaei machines. Some other processors do include the special floating point processors 

££m5 89 8 *K* -T^i 8 ? TOyRE n We enab * fte^powh^; 

processors on the same chip vwth our PMEa hut we would now need additional eefls than is required for the 
P^errad embodiment. For floating point operations, see also the concurrently tiled floating point 
appficatxxi referenced atxwe tor improvements to the IEEE standard. 

The approach developed is wel poieed to take advantage of the normal Increases in VLSI technology 
pwtomam ^ circurt size shrinks and greater packaging density becomes possible then data flow 
efemsnte Ore base and index registers, currently mapped to memory could be moved to hardware 
lAewtee. ttoaJIng point aub-stepe are accelerated with additional hardware which we wtl prater for the 
developing CMOS DRAM technology a* reflaMe higher density levels ere achieved, very importantly. Ms 
hardware aBwnatrvo does not affect any software. " 

The PME Is InttlaTusd to SflVD mode wtih inlemipts disabled. Commands am ted Into me PME 
opeiawn decode buffet from toe bci. Each Sme en menucSon operation competes; the PME requests a 
new command from the BO. to a similar manner, immediate data is requested from the BCI at the 
approprtato point to the Inslructlon execution cycle. Most instructions of the ISA operate Identically whether 

HL^?l m %?Z°£?. WOd * *• *** p8 °° d tostnxttona and immediate 

data am taken from the BO; m MMO mode ihe PME maintans a program counter (PC) end uaes that aa 
****** * ltWn 88 wn "W^WV to fetch a 13 Wt instruction, tasfrucflons such as 'Branch' which 
ife^y address ma program coumer have no meaning in SIMD mode ond some of toose cods poir.ta are 
rftn ?2 re Sl P !? > ? ^ WMD ftmcC0Be * «mpi*tQ Immediate data against an ama of mam store 
TZ.?^ decode logic permits either 3MMUMD operational modes, and PMEs can 

vansiBon between mooes dynamically, in SWD mode the PME receives decoded menceon information 
andwecutes mat data in the next dock oyde. In MMD mode the PME maintains a program counter PC 
adftess and uses the* as the address within lis own memory to fetch a 16 bit instruction, Instruction decode 

OZZ? kI?^ 88 in , S " y m n***™ " «» in 3*» antera MiMD 
mode when given the InffMnafion that mpreeenta a decode brandi. A PME In MBWD mode enters 0» 8IWD 
mode upon executing a specific instruction (or the transition. 

When PMEs transition dynamically between SIMD and MIMO modes, an MMD mods la entered bv 

ZZZSSft ^ D ZT > C ° n ™ regHrter te9 * w,on wi,h ** appropriate control (fleet to At me 
wmptetton of the SI MO nisirucuon. m pme enters the MIMD mode, enables interrupts, and begins 
fetching and executing its MIMO instructions from the main store location specified by He general register 
^Jr 8 ** w unrnesked *PW*^ on the state of interrupt masks when the MIMD control 
Wtte «ef. me PMEratuma to SIMD mode efiher by being extemafly mfnftiafced or by executing a MIMD 
wnte control register" tntlrucbon with the appropri*, coMrct bit Bet to jmto. 

Data comiTiuntcaiton pates ana contort 

Returning to Figure 7 it win be seen that each PME has 3 mow pons 400, and 3 output porta 480 
comrmmfcatlon plus 1 I/O pari 410. 4fiO tor off chip communications. Existing 
technology, rather than toe processor idea, require* that the off-chip pert be byte wide half duplex. Input 
ports ere connected such tost data may be routed from input to memory, or from input AR register 406to 

^nl^^l m * ***** b * paih Memw * wuld be t»w date sink tor messages targeted at 
the PME or for messages that were moved in 'store and forward" mode. Messages toel do not target the 
parte** PME are sent directly to the required output port, providing a -circuit switched" mode. wt»n 

SS^L!^*^ ™ PME SW b tttUVBa mm per,onmin f «» nw«lng and determining the 
selected transmission mode- Tills makes dynamicafly selecting between 'cfrcuUed svtfched' W d "store «d 
wrwad modes poeiible. This Is ^so another unique cnaracterleslc ol toe PME deston. 
ExJiT ^ 8 ^ ^ «^ h»3 + output port, rjefl. Right Vertical, and 

^SS^h^^^^^J^ *! wW « 1 «» wwe M. duplex penHo-poim 
^^ZtlT' ?f* nie fourth porta are combined in toe pratormd embodim^to 

provxte a half duplex pomHo-pcxnt connection to an off^Wp PME. Due to pin and power constraints fcat we 
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have imposed to make use of the Jess dsns* CMOS we employ, the actual off-chip interface is a byte-wide 
path which to used to multiplex two halves of the Jnter-PME data word. With epedd "ztpper" 1 circuitry which 
provides a dynamic, temporary logical breaking of insermodaf rings to anew data to enter or leave an array, 
these external PME ports provide the APAP external I/O array function. 

9 For data routed to the PME memory, normal DMA is supported such that toe PME Instruction stream 
must become involved in the processing only at the beginning and end ot messages, Rnsity. data that te 
being 'circuit switched to an Internal output port la forwarded without electing. This permits single cycle 
data transfers within a chip and detects when chip crossings will occur such that the fastest but Mill reBebte 
communication cart occur. Past forwarding Utilizes forward data paths and backward control paths, ail 

70 operating in transparent mode. In essence, a source looks through several stages to see the acknowledg- 
ments from toe PME performing a DMA or of*cnip transfer. 

As seen by FIGURES 7 and 8 Data on a PME input port may be destined tor the local PME, or for a 
PME further dram the ring. Data declined lor a PME further down the ring may be stored in the local PME 
main memory and then forwarded by the local PME towards the target PME {store and forward*, or me local 

/s input port may be logically connected to a particular local output port {circuit switched) such that ftie dais 
passes ^transparentr? 11 through the focal PME on its way to the target PME Local PME software 
dynamically controls whether or not the local PME is m "store and forward* mode or in "circuit switched' 
mods tor any ot the tour inputs and four outputs, to circuit swtehed mode, the PME concurrently processes 
all tonctlons except the VO associated wBh Bra circuit switch; In store and torwerd mode the PME suspends 

20 all other processing functions *> begin the l-O forwarding process. 

While data may be stored extemafly of the array in a shared memory or DASO (with external controller), 
it may be stored anywhere in the memories provided by PMEs. Input data destined tor the local PME or 
buffered in me local PME during °3tore and forward" operations is placed Into local PME main memory via 
a direct memory access (address} mechanism associated with each of the input ports. A program interrupt 

£s is available to Indicate that a message has been loaded into PME main memory. The local PME program 
interprets header data to determine if the data destined for the local PME te a control message which can 
be used to set up a cJrcutt-swtehed path to another PME, or whether it is a message to be forwarded to 
another PME. Circuit switched paths are com>oSed by toceJ PME software. A circuit switched path logically 
couples a PME input path directly to an output path without passing through any intervening buffer storage. 

30 Since die output paths between PMEs on the same chip have no tolerating buffer storage, data can enter 
the chip, pass through a number of PMEs on the chip and be loaded into a target PMPs main memory in a 
single clock cycle) Only when a circuit switch combination teaves the chip, is en intermediate buffer storags 
required. This reduces foe effective diameter of the APAP array by a number of unbuffered circuit switched 
paths. As a resuft data can be sent from a PME to a target PME In as few clock cycles as mere are 

38 Intervening chips, regardless of the number of PMEs in the path. This kmd of routing can be compered to a 
switched em-faenment to which at each node cycles are required to carry dale on to the next node. Each of 
our nodes has 8 PMEs! 

Memory and Interrupt Levels 

The PME contains 32K by 16 bil 420 dedicated storage words. This storage is completely general and 
can contain both data and program. In SlMD operations all of memory could be data as is characteristic of 
other SlMD massively parallel machines, to M1MD modes, me memory is quite normal; but, unlfte most 
massively parallel MIMD machines the memory tsonttia same chip with the PME and is thus, immediately 

46 available. This men eliminates the need for cache-tog and cache coherency techniques characteristic of 
other massively parallel MIMD machines. In the case for instance of tomos's chip, only 4K resides on the 
chip, and externa! memory interface bus and pins are required. These are elrrcnated by us. 

Low order storage locator are used to provide a set of general purpose registers tor each interrupt 
level. The particular ISA developed tor the PME usee short address fields for these register references. 

ao Interrupts are utilized to manage processing, I/O activities and S/W specified functions (i.e.. a PME in 
normal processing wiD switch to an Interrupt level when incoming to initiates). H the feveJ is not masked, 
the switch is made by changing a poinxsr In H/W such that registers are accessed from a new section of 
tow order memory and by swapping a single PC value. This technique permits rest level switching and 
permits &W to avoid the normal register save operations and also to save status within the Interrupt level 

98 register s. 

Tho PME processor operates on one of eight program interrupt levels. Memory addressing permits a 
perftionJng of the lower 576 words of memory amoung the eight levels of interrupts, 84 of thee* 578 words 
of memory are directly addressable by programs executing on any of the eight levels. The other 512 words 
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are pertrtfoned into eight 64 ward segments. Each 64 word segment rs directly accessible onfy by programs 
executing on ita associated interrupt level. Indirect addreealng tedtnkjuee are employed tor attorning all 
prcrjrems » access «fl 32K words of pme metrtory. 

The irrterrupt teveis ere asciojwd to tfw input ports, the BO, and to error handing. There b a *normaT 
s level but there Is no "privileged* nor 'supervisor tevel A program interrupt causes a context switch In 
whk* the oontensa of the PC program counter, statu&'contiol register, end selected general registers ere 
stored <n specified main memory locarions and new values tor these regtesrs are fetched from other 
specified main memory locations. 

Tne PME data flow Discussed with reference to FIGURES 7 and a may be amplified by reference to 
w the additional sections below. In a complex system, the PME data flow uses the combination of the chip as 
an array node ww rnenwy, processor and VO which sends and receives messages wish the BOl that we 
replcato as the basic building block of an MMP built with our APAP, The MMP can handle many word 
lengths. 

« PME Mufcipts length Data Flow Proceeaina 

The system we describe can perform tha operations handted by current processors »ai the dataflow In 
the PME which te 16 bits wide. TOe is done by performing operations on data lengths which are multiples 
of Id bits, Thta is accomplished by dc*g the operation In 13 bit pieces. One may need to know the result 
to of each piece f». was it zero, was there a carry out of the high bits of the sum). 

Adding two numbers of 48 bits can be an example of daia flow. In this exampfe two numbers of 48 bits 
( a«M7> and WM7) > are added by pcrforrnlng the fallowing In the hardware: 

a{32-47) +b(32-47)->ens<3*47) step one 



1) save the carry out of high bit of sum 

2) lemember If partial result was 2er© or non-zero 

30 a(1801)^bf16<J1)+8avecarn/->anB(ie<>1) step two 

1) eave the carry out of high bit of sum 

2) remember Iff partial result was zero or non-zero from this result and from previous step; if both are 
aa 29ro remember zero; if either is non-zero rernember non-zero 

a(0-15) ♦ bflM5j + saved carry* ens<C-1 5) final step 



* 1 ) 9 this piece is zero and teat piece was zero ana is zero 

2) 3 this piece is zero and last piece was non-aero ens Is non-zero 

3) it this place is non-zero ans is positive or negative based on sign of sum {assuming no overflow) 

4) it carry Into sign of ans os not-equal to carry out of sign of answer, ans has wrong sign and result is 
an overflow (can not properly represent in the avalafcte oils) 

45 The tengift can be extended by repeating the second step in the middle as many times as required if the 
booth were 32 the second step would not Da performed. & the length were greater than <a step two would 
be done multiple limes. If the length ware just 16 the operation in step one. with conditions 3 and 4 of the 
final step would be done. Emending the length of the operands to multiple iengfc 0 f the data Sow a 
technique having a cwsersjence that trie instruction usually taxes longer to execute for a narrower data 

so flow. That Is. a 32 bit add on a 32 bit data flow only taxes one pass through (he adder logic while the same 
add on a 1 6 bit data flow takes two passes through the adder logic. 

What we nave done that is interesting te that in the current implementation of me machine we have 
stogie Instructions which can perform adds^ubtracta/comp^esAnoves on operands of length f to 8 words 
flang* ii&i defined as part of the instruction), individual tostructtone aveaabte to me programmer perform the 

es same tend of operations as shown above for steps one, two, and final (except to the programmer the 
operand length is longer i.e. 13 to 128 bJtei « the bare bones hardware level, we are working on ie bits at 
a time, but the programme thinks tfhe'e doing 16 to 128 bite at a time. 
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By using combination* of these Instructions, operands of any length can ha manipulated by ths 
prooxamrnor 1.0. two instructions can be used io add wo numbere of up to 260 bits In length, 

PMC Processor 

S 

Our PME processor ia cfifrerent from modern miaoprocessora currently ufiSzed for MPP appicetions. 
Tha processor portion differences include: 

1. the processor 19 a tuly pro^remmsWe hardwired computer (see the ISA description for on Instruction 
set overview) with: 

10 o ft has a complete memory modub shown in the upper right comer (see FIGURE S\ 

0 it has hardware ro$fc*r$ wrth controls required to emufcte separate register set9 for each interrupt 

rev^i (shown m upper tell corner), 
o its ALU has the required registers and controls to permit effective muffi-cycle integer and floating 

point arttfrmetfo, 

;s o B has VO avrftchmg paths needed to support packs* or circuit switched data movement between 

PMEs imeroonnecied by point-lo-polnt Hnks shown In (he lower light comer. 

2. This Is our rranimaMst approach to processor detiQn permitting eight replications of the PME per drip 
with the CMOS DRAM technology. 

3. This processor portion of the PME provides about the minimum dataflow width required 10 encode a 
50 fast instruction Set Architecture {ISA) -see TeWes - which is required to permit effective Mivro or SIMD 

operation of our WMP, 

PME Resident Solars 

The PME Is the smallest element of the APAP capable of executing a stored program It can execute a 
program which is resident m some external control etement end fed to it by the broadcast and control 
interface <8CI> in SiMD mode or it can execute a program which Is resident In Its own main memory <M1MD 
modei l* can dynamically switch between SIMD mode and MIVID mode representing SJMD/MIMD mode 
duality functions, and ateo fte system can execute these ouaieos & the same time (SJMJMD mode). A 

30 particular PME can make this dynamic switch by merely setting or resetting a bU In a control register. Since 
SIMD PME software ia actually resident in the external control element, further drscusston of tills may be 
found In our discussion of the Anray Director and In related applications, 

M1MD software la stored Lmo the PME main memory while the PME is in SIMD mode. This Is Feasible 
since many of the PMEs w3l contain identical programs because frey will be processing similar data tn an 
3s asynchronous manner. Here we would note that these programs are not fixed, but they can be modified by 
loading the MND program from so external source during processing of other operations. 

Since the PME instruction set arctrftecture represented In the- Tables is that of a mtaooomputer. there 
are few restrictions with this architecture on (he functions which the PME can execute, Essentiaty each 
PME can function Ike a RISC microprocessor. Typical mimd PME software routines ere Mated beta*: 
40 1. Baste control programs for dispatching and prioritizing the various resident routines, 

2. Communication software So pass data and control messages among the PME&. This software would 
determine when a particular PME would go Wo/out of the "circuit switched" mode, it performs a "store 
and forward" function as appropriate, ft also Irrigates, sends, receives, and terminates messages between 
its own main memory and that of another PME 
4* 3. interrupt handling software completes the context switch, and responds to an event which has caused 
fto rnterrupt These cen rnctude fart-safe routines and rerouting or reassignment 0} PMEs to an array, 
A. Routines which tmtfomeru the extended Instruction Set Architecture which we describe below. These 
routings perform macro level instructions such as extended precision fixed point arithmetic, floating point 
aifthrnetic. vector arithmetic, and the like. This permits not only complex mam to be handled but image 
so processing activities for display of image data In multiple dimensions {2d and 3d Images) and multimedia 
processes. 

5. Standard mathematics! flbrary functions can be included. These can preferably include UNPAK and 
VP$S routines. Since each PME may he operating on a cMferem element of a vector or matrix, the 
various PMEs may all be executing different routines or differing portions of the seme matrix at one time. 
55 6. Specjofuttd routines for performing scatler.^ather or sorting functions which take advantage of the 
APAP noda) interconnection structure and permit dynamic muW-tfmerrslonal routing are provided. The 
routes effectively taxe advantage of some amount of syrtchrerizeJon provided among the various 
PMEs. while permitting asynchronous operations to continue. For sorts, there are sort routines. The 
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so 



APAP is w.l suited to a Batcher Sort BnuH that sort requires sxtensive calculations to detarmine 
parttoutor element to compare versus vary short comparison cycles. Program synchranixaion le man- 
aged by the wo statement*. Tto program slows multiple data eternen* per PME a« v«» feme P*eM 
sorts in quite a straight forward manner. 

While each PME has Ha own resident software, the systems made from these iTfcrocomputere can 
sxecufa higher level language processes designed fa scalar and parallel machines. Thus the system can 
execute application programs *Mch have been written for UNfX machines, or those of other operating 
systems, in high level languages such as Fortran, a C+ +. Forewft end so on. 

1 may to an Interesting footnote that our processor concepts use an approach to processor design 
which is quite old. Perhaps nitty yearn of use of e similar ISA design has occurred in IBM's military 
pw**»r*. we have been the *st to recognize that this hind of design con be used to advantage to 
teapfmg the dead ended cwrent modem microprocessor design when combined with our total PME design 
to mowe (he technology to a new path tor use in the next oenkiry. 

Wough d*> processor's design cbaractortsRics aro quire different from other modern microprocessors 
arnftor gate constrained miRary and aerospac* processors hav* uBed the d*^ since thw -SCs. ft provides 
sUlktent instructions and registers lor straight forward compter tjasretopmanL and both general and sUyal 
J[! effect!^ ruraiho on this design. Our design has minimal gate requirements, 
and CM has implemented some similar concepts for years when embedded chip designs wore needed 
8«»alpiirpose^esslng. Our adoption now of parts of As older ISA design permits use of many uQBBes 

^1*™??*?** °' our Wtems « a rapid rata beS of the 

listing bare and the knowtedge tfrai many programmers hava of the design concapts. 



PME I/O 



The PME wBJ interface to the broadcast and control interface (BCf) bus by either reedtna data from the 
*W *J BCI in FIGURE 8 or by feeing L^ZXl^XZ 

«o°^ J^ta 1™* shown). The PME powers up In SiMD mode and in that mode reads decodes and 
executes instructions until encountering a branch. A broadcast command SIMD mode causes the tansWon 
to MMD with instructions fetched toc&Ry. A broadcast PME instruction ' WTERNAL CMow revertsthesBts 

™E VO can be sendmg date, passing data or receiving data. When sending data, the PME sets the 
ta^f V C °T C l^ flT U B. V. or >C HW servfoes then pass a block of data from memory 

t^^l^*^^***™ and the XMfT register. This processing interleaves with normal 
Instrueten operation. Depending upon appDcaflon requirements, the block of data transmitted can contain 
.aw data for a predefined PME and'or commands to establish paths. A PME that receives data will store 
input to memory and interrupt the active lower level processing. The Interpretation task at the interrupt level 
^Z^V^l ^l to f! te * * initiate a transparent I/O operation (Mien data is 

^Tlf^^l^ ** "a**** I/O operation, me PME is free to continue executor), its CTL 
ElTTl!! ^ 1 P» fhrough it without gating, and it will remain in that mode until an 

ttZtSSESSF ™* 8 PME * ^ data H cannot be s date source, but H 



PME Broadcast Section 



This is a chip-to-common conbol desfce interface. It can be used by the device that serves as a 
oomroiter to command vq, or ^ fl^go^ In6 complete chip. 

siput is word sequences (either instruction or data) that are available to subsats of PMEs. Associated 
TLT d " 8 *h'°h PMEe wBl use the word. The the BC will use the word bothto 

2* tr-IS ^ * *l T^t^. 10 ^ *" * * e ** ed ™*> data. This serves to 
£S^5»^« ^ruonoua PME opereticne (Evan when in SIMD mode PMEs are asynchronous 
™ JI^k , ^ P roce9a,n g > T« '"«l«n| S m permits PMEs to be formed into groups which are 

controlled by Interfeaved sets of commarxtdata words received over the BCI 

andS^Sn^' 0 *" 601 aC °* Pte rcqU8rt *™ *» combines them, 

^JT^'T* requeSt ™ 9 mechanl " n can to used m several ways. MIMD processes can 
SlSXl ™ °* pr0C9S2 °? TJ" 6hd with an ouiput signal. The 'ANty of signals triggers the 
pZ^L^T c pn3Ce8S - A P p * cat,ons ' *» "^V «asee. will not to able to load al required 8M? in 

^sSSe'syST * ^ eWtre " er wa,beusedt0 * 6/w weday hW perhaps the 
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The controller uses * serial scan loop ftrough many chips to acquire information on individual chips ci- 
PHEr. These loops Initially interface 10 the BCl but can in the BCl be bridged to individual PMEs. 

Broadcast and Control jnterface 

3 

The BCl broadcast and control interface provkted on each cJtip provides a paraSei input interface such 
that data or instructions can be sent to the node. Incoming data is tagged with a subset Identifier and the 
BCl includes the function* required to assure that all PMEs within the node> operating within the subset are 
provided the data or Instructions, The parallel Interface at the BCl serves both as a port to permit data to be 
10 broadcast to aU PMEs and as the instruction interface during SIMD operations. Satisfying bom requirements 
plus extending those requirements to supporting subset operations is completely unique to this design 
approach. 

Our BQ paralel input Interface permUs data or instructions to be sent from a control element that is 
external to the node. The BCl contains "group e^nrnem* registers (see the grouping conoep* in our 

js above application entitled GROUPING OF SIMD PICKETS) which are associated with each of the PMEs, 
Incoming data words are tagged with a group identifier and the BCl Includes the functions required to 
assure that all PMEs wfttin the node which are assigned to die dedicated group are provided the data or 
tossajcaorrs. The paraitel Interface cf the BCl serves as both a port to permit data to be broadcast to the 
PMEs during hDMD operations, and as the PME instaicyonArruroedtate operand Interface during SIMD 

as operations. 

The BCl also provides two serial knsrfaces. The high speed serial port vriH provide each PME with the 
capability to output a limited amount of status information. That data is intended to: 

1. signal our Array Director 610 when the PME, e.g, 500, has data mat needs to be read or that the PME 
has completed some operation. It passes a message to the external control element represented by the 

so Array Director. 

2. provide activity status such mat external test and monitor dements can ttustrete the status erf the 
entire system. 

The standard serial port permits the external control element means tor selectively accessing a specrSc 
PME for monitor and control purposes. Data passed over this interface can direct data from the BCl parallel 

ao interface to a particular PME register or can select data from a particular PME register and route it to the 
high speed seriat port. These control posits allow the external control element to monitor end control 
Individual PMEs during initial power up and diagnostic phases, ft permits Array Director to input control data 
so as to direct B*e port to particular PME and node Internal registers and access points. These registers 
provtde paths such mat PME of a node can output data to fre Array Director, and these registers permit the 

as Array Director to input data to the units during initial power up and diagnostic phases. Data input to access 
point can be used to control test and diagnostic operations* le* perform single instruction step, slop on 
compare, break points, eta 

Node Tcpotogy 

Our modified hypercube topology connecfion is most useful for massively parallel systems, whfo other 
toss powerful connections can be used with our basic PMEs* Within out initial embodiment of the VLSI chip 
are sight PMEs with toay distributed PME Interna) hardware connections- The Internal PME to PME chip 
consignation is a two rings of four PMEs, with each PME also having one connection to a PME in the other 

4s ring. For the case or eight PMEs to a vlsi chip this to a three dimensional binary hypercube, however our 
approach In general does not use hypercube organizations wffiiln the chip. Each PME also provides tor the 
escape of one bus. In the initial embodiment the escaped buses form one ring are cafled +X + Y. +W and 
+Z> while those from fre other ring ere labeled similarly except - (minus). 

The specific chip organization is referred to as the node of tre array, and a node can be in a cluster of 

so the array. The nodes are connected using +-X and +-Y Into an array, to create a duster. The 
dimensionality of me array is arbitrary, and to general greater than two which is the condition required for 
developing a binary hypercube. The dusters are then connected using *-W, *-Z Into a array of dusters. 
Again, the o^nenslonailty of the array is arbitrary. The result is He 4 -dimension^ hypercube of nodes. The 
extension to a ^-dimensional hypercube requires the usage of a i0 PME node and uses the additional two 

58 buses, say +-EI to connect (he dimensional hypercube into a vector of hypercubes. We have tfien shown 
the pattern ot extension to either odd or even radix hypercube*. Thla modified topology limits the cluster to 
cluster wiring while supporting me advantages of the hypercube connection. 
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S3 



Our wireabifey and topology can%unjHon far massively parallel machines has eoVsnteaes m kaspira 
Ihe X and Y dimensions w»ln our duster tevel of packaging, and m dsribuiing ihe W and Z bue 
cormecsons toaifhe neighboring ctostera After rmptattiring ft* technique* Described* the product wis be 
wtreablo. and manufectwatte white maintaining the inherent cbaractswfeecs of tfw topology defined 

i The note consists of k ' n PMEs p*ua the Broadcast and Control Interface <0Cl> section. Here 
repwcenta the number of dimensions or rings, whfcri characterize me modified hypercube, while «V 
represerrts ihe number of rings that charactaize me node. Although a node can contain k rings h is a 
characters of the concept that only two of those rings may provide escape buses. V and Tc w is Rmted 
to our preferred embodiment, because of the physical chip package to N-4 and This Hmtetkm ie a 

> pnys*ai one, end different ch^ps sets will allow other and increased dSmenaonafty in the array. In addition 
to being a pan of the physio* cf*> package, it ko our preferred embodiment* provide a grouping of PMEs 
that intercorinecf a set of rings in a modified hypercube. Each node will have 8 PMEs with their PME 
a/ctrtfechjre and ability to perform prooaselng and data router funchona As such, n is the tfmensferuaity of 
*e modMed hypercube (see following section). Le. a 4d modified hypercube's node element would be 8 

i PMEs white a 5d modified rryperajbe's node would be 10 PMEs. For visuefeaSon of nodes which we can 
ew^k^ refer to FIGURE 6. as wel as FIGURES 9 and 10 for vteuateaBon of Interomnecttons and see 
I™*?" a block diagram of each node. FIGURES ie and 10 elaborate on possible imercoiviections 
tor an APAP, 

It win be noted ihat the appication entitled "METHOD FOR INTERCONNECTING AT© SYSTEM OF 
IWTERCONNECTED PROCESSING ELEMENTS 1 ' of CO-btver** Davfcd B. Rone, filed in the United St**S 
!^ ^ Tractemark office on May 13, 1991, under USSN 0JW»&886, described the modinod hypercube 
criteria whtch can preferably be used In connection with our APAP MMP, 

That appteation is incorporated by reference and describes a method of ttercorwefing processino 
elements m such a way that the number of <x>rviecJk>f» per element can be balanced against ihe rietwork 
oTtemeter ^cm case path length}, tw, is done by creating a topology that nwintains many of the well 
known and desirable mpotoojcai pmperies of hypercubes who* improving Hs fteribifty by eniimeraSng ihe 
nooes of the network in number systems whose base can be varied. When using a base 2 number system 
this method creates the hypercube fcpotogy. The invention has fewer b^rconnections fhao a hypercube. 
uniform connections and preserves the properties of a hypercube. These properties include: i) large 
number of alternate paths, 2) very high aggregate bandwidth, and 3) well understood and existing methods 
ihat can be used to map other common problem topologies with the topology of the network. The km* is a 
general™*) non-binary hypercube with less density. It wri be understood that with the preference we have 
given to the modified hypercube approach. In some applications a conventional hypercube can be utilized 
h connecting nodes, other approaches to a topology could be used; however, the ones we describe herein 
are believed to be new and an advance, and we prefer the ones we describe, 

mxlesfo 8^^^ ******* hypw *** *W<W inisiconnoctlng a plurality of 

1 * rB1es aKri8of integers et, e2, e3. „. such the product of alt etemerrts equals (he number of PMEs 
tn the network called WL while the product of el elements in the set excepting el end e2 is the numbe. 
n< ?2* f , ^ Nl ,* nd *? N"** <* * lemen * in the set ca»ed m deanes ihe dimensionality of the 

network n by the relationship n = m-& 

Z accesses a PME located by a set of indexes a1. a2, am where each index is the PMEs position in 
the equivalent level of expansion such that the index al Is In the range of zero to eM for i equal to 1 2 
to nr. by the formula * ' ' 

Ma^etm-I) * a (m-2»s<m-1) .,-a#y*(i))*e<1) 

where the notation a® has ihe normal meaning of the me *h element m the list of elements called a. or 
equivaJentry for e. 

3. connects two PMEs (with addresses f and g> ff and only if either of the following two conditions hokt 
a,tt»siieger part of tf(ei 'adequate the integer part of en> i * e2) and: 

1. Ihe remainder part of r*>1 differs from the remainder pan of srtii by 1 or, 

2. the remainder part of rfe2 differs from the rernainder pari of s/e2 by 1 or e2-i. 

me remainder pan of ret <mfers from the remainder pert of *e) tor I m me range 3. 4, m and the 

remainder part of equate me remaioder pari of which equals i minus three, and the 
remainder part of r/e2 dlBers from the remainder pert of 3/e2 by e2 minus one 
*J ***** ^^P^ng astern nodes witt form a non-binary hypercube, wnh the potenaal fcr being 
different radix m each danension. The node is defined as an array of PMEs which supports *n ports such 
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thai th* ports provided by nodes match the dfencriEiortaity requirements of the modified hypsrcubs, tf Ins 
act of Integers e& e4. em. which deQno me spedffc extent of each cfimeneton of a particular motfifled 
hyper cube ire a* taken a* equal, say b, and if ei and *2 are taken a 1, then me prerfous formulas for 
addressability and connections reduce ta 
s l.N»b-*n 

2. addressing a PME as nurnbere representing the base b numbering system 

3. connecdng two compiling elements (f and gj 9 and only if the address of f differs from the address of 
g \n exactly one base b digit using the rule mat 0 and b-1 are separated by 1. 

4. me number of oonractaa supported by each PME Is ?h 

Which is exactly as described in the base application, with Cha number of communication buses spanning 
noo-adjacem pmes chosen as zero. 

Infra-Nods PME BMsroormections: 

;s PMEs are configured within the node as a 2 by n array. Each PME is interconnected with te three* 
neighbors (edges wrap only in the second dimension) using a eel of input/bmput ports* thus, providing fuD 
duplex commumcetion capability between PMEs, Each PMEs external input and output port is connected to 
node I/O pms. input end output ports may be connected to share pins for harf-duptex communication or to 
separate pins for full-duplex capability. The tnteramnections tor a 4d modified hypercube node are shown 

so in FIGURES 9 and 10. (Note that where n '*$ even me node can be considered to be a 2 by 2 by n/2 array.) 
FIGURE 9 iustrares the the eight processing demems 56t), 510. 520, 530, 540, 550, 560, 570 within 
the node, The PMEs are connected in a binary hypercube communication network. This binary hypercube 
displays three intra connections between PMEs (501, 51 1 521 r 531, $41, 551. 561, 571, 590. 591. 592, 593>. 
OommunicQfion between me PME is contofled by in and out registers under control of the processing 

£S element TWs diagram shows me various paths that can be taken to escape W> out any of the eight 
directions, *-w 525. 565. +-x 515. 555. +-y 505. 545. +-z 535. 575. The eommurtkafion can be 
accomplished without storing the data into memory if desired. 

It may be noted mat whfie a network switch chip coUd be employed to connect various cards each 
having our chip v#h other chips of me system, r can ana should <tesir*Dty be eliminated. Our inter PME 

30 network that we describe as (he M 4d torus" Is the mecharisrn used for Inter PME-cornmunlcatiort A PME 
can reach any other PME m me array on this interface. (PMEs in between may be StocerTfcrward or Circuit 
Switched) 

Chip Rejatjonsteps for lnterr>>nngcifona 

38 

We have discussed (he chip, and FIGURE ii shows a block diagram ot the PME ProeessotfMemory 
chip. The chip oonsists of me following e&mems each of which will be described in later paragraphs: 

1. d PMEs each consisting of a 'l6 btl programmabfo processor and 32K words of rrrernor y (84K bytes). 

2. Broadcast Interface (B&) to permit a controller to operate a0 or subsets of the PMEs and to 
40 eccurmilate PME requests, 

3. interconnection Levels 

a. Each PME supports four 8 bit wide inaer-PME communication paths. These connect to 3 
neighboring PMEs on me chip and 1 ofr chip PME, 

b. Broadcast- to-PME busing* which makes data or instructions available. 

46 c Service Request lines that permit any pme to send a code to the convener. The bqi combines me 
requests and forwards a summary, 

d. Serial Service loops permit the controller to read afi detail about me functional blocks. This level of 
interconnection extends from the BO to a» PMEs (FIGURE 1 1 lor ease of presentation omits this 
dexsH.) 

so 

interconnecfton and Routing. 

The mpp will be implemented by repHceflcn of me PME. The degree of replication does not affect me 
interconnection and routing schemes used, FIGURE 6 provides an overview of me neswork Interconnection 
55 scheme. The chip contains & PME3 with interconnections to their immediate neighbors- This interconnection 
pattern results in mo three dimensional cube structure shown in FIGURE 10. Each ot me processors within 
§>s cube has a dedicated Wdirecfiona! byte port to the chip's pins; vw* refer to the set of 0 PMEs as a node. 
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An n by n array <rf nodes is a cluster. Simpfe bn^n? between me * and - x porta and the + «nd*y 
porta provide the duster note Wercflwractjorw. Hera me our pre* arret! chip or node has 8 PMEa, each of 
whtch manages a single external port- This distributes Hi* network control function and eamrnates a possible 
bottleneck for ports. Bridging the outer edges makes the duster into a logical torus. We have considered 
clusters «tih a = 4 and n * 8 and beDeve that n a a Is the better solution tor common*! applteafiona white 
n-4 te better tor miftery conduction cooled applicator*. Our concept does not impose an uncteneeaW* 
cluster siza. On the contrary , w* anticipate some applications using variations. 

An arrey of dusters results to the 4 <tmensJonaJ tow Of hycercube structure illustrated in FIGURE 10. 
Brk^frjg between the + arid - w ports and + and - z ports provides the 4d torus intersection*. This 
results in each node within a cluster connected to an equivalent node in al adjacent dusters. (This provides 
04 ports between two adjacent dusters tm& than the 0 ports that wotdd result from larger dusters.} ^ 
wHh the cluster size, the scheme does not imply a particular size array. We have considered 2x1 arrays 
desirable for vrorkslatfons and MIL appScafions and 4x4. <&8 and flx8 arrays tor mainframe apctotkms 

Developing an array of 4d ioruses Is beyond the gate, pin, and connector limitations of out current 
preferred chip. However. 3ud limitation disappears with our etemairve on-chip optical ctiverteuwer is 
employed, to (his embodlrnenl our network couM use an optical path per PIVE; with 12 rather than 6 PIVEe 
p^clilp the array of 4d toruses with mu&-TBop (Teraflop) performance end It seems to be economlcely 
feestote to moke such machines available for the workstation environment Remember that sudh aJtemaSve 
machines will use the application programs developed for our current preferred embodiment 

4d duster Organization 

For constructing a 4d modified itypercub* 360, as itostrafed in FIGURES 6 and 10. nodes support^ 8 
external ports 3tt> era required Consider «he external ports to be labeled as +X, +Y. +2, +W, K -Y,-Z. 
-W. Then using m, nodes, a ring can be constructed by oonnecflng the +X to -X ports. Again m* vvch 
rmgs can be interconnected info a ring of rings by intercormectrrig me matching *Yto -Y porta. This level 
of structure will be called a cJusier 320. With mi it provides tor 51? PMEe and such a cluster will 

be a buading block for several size systems <33& 340. 3501 as ISustreted with m =8 In FK3URE * 

4d Array Organization 

For building large tme-grwried system* sets of m$ dusters are interconnected in rows using the +Z to 
-Z ports. The rm rows are then Interconnected using the ♦ W to -W ports. For m, «..jtu *8 this results In 
system with 32768 or PMEe, The organization does not require that every dmenston be equal y 
populated as shown in FIGURE 6 (large 0ne^rafnad parallel processor 370). in the case of the fine-grained 
small processor, onry e duster might be used wtm the unused Z and W porta being interconnected on the 
card. Thte technique saves card connector pins and makes possible me application erf this scalable 
processor to workstefcons 340, 350 and avionics applications 330, both of which are connector pm imited. 
Connec»ng *h ports together to the Z and w pairs leads to a workstation organization that permrto debug, 
test and large machine acfiwar* aevetopment w 

Again, much smaller scale versions of the structore can be developed by generating the structure with a 
vafae smaller than m»a Tme wll permit construction of single card processors compatible with the 
r«quirernen» tor accelerators in the PS/2 or RiSC System 6000 workstation 350, 

I/O Performance 



IX) performance includes overhead to setup transfers and actual burst rate data movement Setup 
overhead depends upon application function ¥0 complexity and network contention. For example, an 
application can program circuit switched treble with buttering to resolve conflicts or H can have all PMEs 
transmit left and synchronize. In te first case. I/O is a major taste and detailed analysis would be used to 
size it We estimate that simple case setup overhead Is 20 to 30 dock cycles or & to 1 2 u-eec 

Burst rate W is the maximum rate a PME can transfer dara (with either an on or off chto ndghbor.i 
Memory access Emits set the data rate at 140 nsec per byte, corresponding to 7.14 rVtoyte/s, This 
pertonr^ includes buffer address and count processing plus data maovwrit*. It uses seven 4)ns cycles 
per 16 bit wond transferred. 

This burst rate petfomaanoe corresponds to s duster having a maximum potential transfer rote of 3*6 
Qbytea/s- U afeo means that a set of eight nodes along a row or column ot the duster will achieve 57 
Mbyte* burst data rate using one set ol their 6 available ports. This number is significant because I/O with 
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the external world wttl be dons by logically 'unzipping 1 an edge of the wrapped duster and attaching It to 
the externa! system bus. 

fnter-PME Routing Protocol 

The SIMD/MIMD PME comprises inteiproceasor ooiiwumfcanon to the externa) I/O facitffies, broadcast 
consroi Inrorfaces, and switching teaTures which afcow both SIMD/t*IIMD operation wfthln the same PWE. 
Embedded In tie PME la the fuSy distributed programmebie I/O router for processor communication and 
data transfers between PMEe, 

10 The PrVtEs have fully distributed mterprocessor communbatbn hardware to on-chip PMEs as well as to 
the external VO facilities which connect to ale interconnected PMEa in the modified hypercube ccrrngura- 
ttom This herdware Is cojnptefr.errted wah the flexible prc^mmabiffiy of the PME to control the VO activity 
via software. The programmable I/O router functions provide lor operating data packets and packet 
addresses, With this Information the PME can send the information mm the network of PMEs In a directed 

is method or out muftSpts paths determined by any fault t iterance requirements. 

Distributed lauH tolerance algorithms or program algorithms can Lake advantage of Ihe programmaMJty 
along wim the supported circuit switched modes of the PME. This performance combinational mode 
enables everyihlng from ofHlne PMEs or optimal path data structures to be accomplished via the 
programmable I/O router. 

20 Our study of applications reveals that it is sometimes most efficient to send bare date between PME*. 
At ether times applications require data and routing information. Further, it is sometimes possible to pran 
comrnumcattons so (hat oetwork contacts cannot occur: other applications offer the potential for deadlock, 
unless mechanism* for buffering messages at Intermediate nodes are provided. Two examples illustrate me 
extremes. In the relaxation phase of a POE solution, each grid point can be allocated lo a nodev The inner 

20 loop process of acquiring data from a neighbor can easily be synchronized over an nodes, Atterrtatrvety, 
image transformations use local data pararo&iers to determine communication target or source IdenrJSers. 
This results in data moves through multiple PMEs, and each PME becomes Evolved in the routing task for 
each packet Preplanning such traffic is generally not possible- 

To enable the network to be efficient ior ail types of transfer requirements, we partraoa between the 

so H/W and the responsttrfDty for data routing between PMEa. S/W does most of (he task sequencing 
function. We added special features to the hardware (HrW) to do the inner loop transfers and minimize 
software (&7W) overhead on the outer loops, 

I/O programs at dedicated interrupt levels manage the network. For moat applications, a PME dedicatee 
tour interrupt levels to receiving data from the four neighbors. We open a buffer at each level by loading 

35 registers at the level, and executing the IN (ft uses buffer address and transfer count but does not await 
data receipt} and RETURN instruction pair- Hardware then accepts words from the particular Input bus end 
stores them to the buffer. The buffer fun condition will fflsn generate tie interrupt and restore fte program 
counter to (he instruction after the RETURN, This approach to interrupt levels permits to programs to be 
written that do not need to test what caused the Interrupt Programs reed data, return, and then continue 

40 directly into processing ihe data they read. Transfer overhead is minimized as meet situations require etas 
or no register saving. Where an application uses syrrchronteatron on I/O, as in (he POE example, then 
programs can be used to provide that capability. 

Write operations can be started m several ways, For the PDE example, at the point where a result Is to 
be sent to a neighbor, the appficaiion level program executes a writs calk The call provides buffer location, 

45 word count and target address. The write subroutine includes the register loads and OUT instructions 
needed to Initiate the H/W and return to the application. H*W does the actual byte by byte data transfer. 
More complicated output requirements will use an output service function at the highest Interrupt level. Bom 
epp&catioii end interrupt level tasks access met service via a soft interrupt. 

Setting up circuit swftcned paths builds on these simple read and write operations. We start wBh all 

so PMEs having open buffers sl2ed lo accept packet headers but not the data. A PME needing lo send data 
inflates m transfer by sending an address/data block to a neighboring PME whose address baiter matches 
the target In *e neighboring PME address Information will be stored; due to butler fuD an interrupt will 
occur. The interrupt SVW tests the target address and wiP either extend ihe buffer to accept the data or write 
die target address to an output port and set the OTL register for transparent data movement. (This allows 

55 the PME to overlap its application executions with the circuit switched bridging operation.) The CTL register 
goes to busy state end remains transparent until reset by the presence of a signal at end of data stream or 
abnormally by PME programming. Any number ot variations on these themes can be implemented. 
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System I/O and Array Oirsctor 

Fl© we 1 2 stw* an Array Director fat the preferred embodimem. which may perform the functions of 
the cccttrotter of FIGURE 13 which describes the system bus to Array corwecticro, FIGURE 13 b composed 
of two parte, (a) (ha bus tarfrom a duster, and part (b) rhe commu ricaSon of Wormatton on the bus toyirom 
a PME. Loading or unfcaclng the array id done by connecting the edges or clusters to the system bus 
Multiple system buses can be supported with mutapte dusters. Each duster supports 50 to 57 Mbyte's 
tend^ldm. Loading or unloading the parallel array requires moving data between all * a subset of the 
PMEs and standard buses (to WcroCbannei, VM&bue, FutureBue, etc). Thoee buses, part of the host 
processor or array coMrofer, are essumad to be rigidly specified. The PME Array therefore must be 
adapted to the buses. The p*c Array can be matched to me bandwtd& of any busby interleave bus data 
onto n PMEs, with n picked to permit PMEs both I/O and processing time. FIGURE 1 3 shows how we might 
connect the system buses to the PMEs at two edges of a cluster. Such an approach would pern* 114 
Mbyte* to be exported. It also permits data to be loaded at half the peak rate to two edges stmute- 
neouary; Although this recfaces ths barxfaridm to 57 Mbyte/s/duster. n has ths advantage of providing 
orthogonal data movement within 8w array and aunty to pass data between two buses. (We use those 
advarftages to provide fast transpose and matrix mutHpiy operation.) 

Ae shown in part (a) of FIGURE 1$, the bus "dots to at) paths on the edges at the cluster; and the 
controller generates a gate signal to each path in the required Interleave timing, rf required to cennect'to a 
system bus greater than 57 Mbytes me data »» be interleaved over muttipJ* cluster*. For example, in 
a sysero requiring 200 Mbyte* system buses, groups of 2 or 4 clusters will be used. A large MPP has the 
capacity to attach le or W such buses to rto xy network paths. usb« trie w and z paths to addHion to the 
x and y pair*, mat rwrnber could be doubled. 

FI9URE 13 part (b) shows how the data routes to individual PMEs. The FIGURE shows one particular 
w.x,y or z path that can be operated at JM3 ^yteA, to burst mode, if toe data on the system bus occurred 
« bursts, and r the ^PME memory could contain me complete burst, then only one PME would be reared . 
we dasigned the PME I/O etojoum to require neither of these conditions. Data can be gated Into the 
PMBcO at the tofl rate ur^ buffet fun occur*. At that Instant PMEnO wffl change to transparent and PMBci 
will begin accepting the data Wfthin PMExO processing of the input cute buffer can begin. PMEs that have 
taken data and processed H are limited because they cannot transmH the results while in ihe transparent 
mode The design resolves this by svrrtchmg the data stream to toe opposite end of the path at inte*vato . 
FIGURE 13(b) shows that under S W control one might dedicate PMExO through PMEx3 to accepting data 
while PMEX12 through PMEx15 unload results and vtea-verwu The conbtdler counts words and adds end of 
block signals to the data stream, causing the switch to direction, one count applies to all paths supported 
by the controller so ccntroiter workload is reasonable. ^ 

SYSTEMS FOR ALTERNATIVE COMPUTERS 

FIGURE ia immw a system block rjegram for a host attached large system with a single application 
^^!^^^ Tftte l ^traSonmayalsor)evi^ 

be employed In stand atone system which use multiple apptcation processor interfaces (not shown! This 
configuration will support BASD«3nahpJcs on all or many dusters- workstation accelerators can eliminate the 
host, applies* on processor interface (API) and duster synchronber (CS> Illustrated by emulation. The CS 
not always required, ft wifl depend on (he type ot proceBsing mat Is being performed, as wel as the 
physicaJ drive or power provided for a particular application which uses our invention. An sposcatton this la 
doing mostly WIMD processing will not place as high a workload demand on the controler, so here the 
control bus can see very etow pulse rise times. Conversely, system doing moeBy asynchronous A-SIMD 
opera*** w .th many independent groupings may require faster control busing, in this case, a cluster 
synchronizer will be desirable. 

The system btocfc dtogram of FIGURE 18 Illustrates that a system might consist of host array comroto 
*Ji^J!!* y \ v " PW E m *** a *#<* dusters supported by e set ot cluster controllers (CC). Although 
a CCfe showri iter each cluster that relationship is not saictfy required. The actual ratio ot clusters to GO Is 
flexible. The CO provides redrfc* to, and accumulation from the 84 BClsWusters. Therefore, physical 
pammetere > can be used establish the maximum ratio. Additionally, the CC wtt provide for coaling 

^L^^^^^^ reQu * emente * «* ***** application of our Invention. Two 
£ . , , be used. A duster that te to be connected to a system bus requires the CC providing 

interleave controls (see System VO and FIGURE 18) and iri-etete drivers, A more eimpte version lhai3 
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the tri-stcfe busing features can also bo employed, tn the case of large systems, a second stage of redrrve 
and accumulation to added. This level is the duster synchronic (OS), The set of CCs plus C8 and the 
Application Processor Interface (API) make up the Array Controller. Only the API is a programmable un*. 
Several variations of this system synthesiB schemo will be used. Those resuft in different hardware 

9 ronflguratkra lor various applications^ but may do not have a major Impact on the supporting software. 

For a workstation accelerator, the duster controllers will be attached directly to the workstation system 
bus; the function of the API will be performed by the workstation. In the case ol a RI8&8O0O, the system 
bus Is a Micro Channel and the CC units can plug directly Into the slots within mo workstation. This 
configuration places the VO devices (DASO. SCSI and display Interfaces) on the same bus that 

w toadsAjnloada the array. As such the parallel array can be used for I/O intensive tasks like real time image 
generation or processing. For workstations using other bus systems (VME-bus, FutureBu& e&), a gateway 
Interface will be used. Such modules are readily avaaabJe in the commercial marketplace, Note mat in these 
minimal scale systems a single CC can be shared between a determined number of dusters, and neftrter a 
C$ nor an API Is needed- 

;□ A MIL avionics application might be similar in size to a workstation, but h needs d3ferent interfacing. 
Consider what may become me normal mlBlary situation. An Brisling platform must be enhanced with 
additional processing capability, but fending prohibits a complete processing system redesign- For mis we 
would attech to the APAP array a smart memory coprocessor, to 94s case, a special application program 
interface API that appears so (he host as memory will be provided. Data addressed to the host's memory 

20 w» men be moved to the array via CC<$). Subsequent varftes to memory can be detected and interpreted as 
commands by ihe API so that the accelerator appears to be a memory mapped coprocessor. 

Large systems can be developed as either host attached or as siand alone configurations, For a host 
attached system, me oonfiguraSon shown in FIGURE 18 is useful. The host wit be responsible tor I/O, and 
the API would serve as a cflspaichad task manager. However, a large stand alone system is else possible in 

2S special situations. For example, a database search system might eliminate the host, attach DA$D to the 
MtcreCfterweie of every duster and use multiple APIs as bus masters slaved to me PMEs. 

Zipper Airey interface wtm External l/Q 

30 Our zipper provides a fast lt> connection scheme and is accompished by placing a switch between two 
nodes of the array. This switch vjfll allow for me parallel communication into and out of me array. The fast 
wtti be implemented along one edge of me array rings and acts like a targe zipper into the >C Y, W, Z 
rings. The name "zipper connection* is given to the fast I/O. Allowing data to be transferred into and out of 
the network while enhr adding switch delays to transfer the data between processors is a unique loading 

38 technique. The switching scheme does net disrupt the ring topology crested by the X V, W. Z buses and 
special support hardware eBows fre ripper operation to occur while the PC is processing or routing data 

The ability to bring data into and out of a massively parallel system rapidly is en important 
enrtanoerrierrl to the performance of the overall aysteraWe believe that the way we irr^emerrt our fast MO 
without reducing the number of processors or dimension of the array network is unique in the Held of 

40 massively paraDei erMrcmments, 

The modified hypsreube arrangement can be extended to permit a topology which comprises rings 
within rings. To support me interface to the external I/O any or all of the rings can be togfcafly broken. The 
two ends of the broken ring can then be connected to external 10 buses. Breaking the rings is a logical 
operation so as to permit regular inter-PME comrminicaOon ai certain time tntervafe while permrtt ng MO at 

<a other time intervals* This process of breaking a level of rings within the modified hypercubo effectively 
'unzips 1 rings for UO purposes. The fast I/O ■zipper" provides a sepsrste Interface into the array. This ripper 
may exist on f to n edges of the modified hyperoube and could support either pareJJo! input into mumpte 
dimensions of the array or broadcast to multiple dimensions of the stray. Furwe* data transfers into or out 
of the array could alternate between the two nodes cfimctty attached to the zipper. This UO approach is 

so unique and it permits developing different zipper sizes to satisfy particular application requirements. For 
example, in the particular configuration shown m FIGURE 6, called ft* large fine-grained processor 35« the 
zipper for the Z and W buses will be doted onto ihe MCA bus. This approach optimizes the matrix 
transposition time, satisfying a particular application requirement for Ihe processor. For a more detailed 
explanation of me "zipper" structure, reference may be had to the APAP I/O ZIPPER application fited filed 

ss concurrently herewith. The zipper b hens illustrated by Figure 14. 

Depending on the configuration and the need of the program to roll data and program into and out of 
foe individual processing elements, the size oi the zipper can be varied. The actual speed of the VO zipper 
is approxbnately the number of rings attached limes ihe PME bus width, femes the PME dock rale atl 
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divided by 2. (The dMston permte the receiving PME firm to mm* data onward. Sine* ft cat send it to any 
^11^ °° fl !f t * 5n * """P** 8 * *w*»d «w the /Way.) With existing technology, hl. S mb/soc 
PME bensfe,,**. 64 rings on th* zfeer, «nd interleaved to too nodes trensf^ 320MB!»ec Anay traneto, 
rase are possible (Sea iho typical ripper oorwgurabon in FIGURE 14). FIGURE 14 tOustratee the feet VO or 
s the so-eaBed -zipper connection' 700. 710 which exists as a separate interlace Into the anay. TWa zipper 

n^Ptenodaa In the array 751. 752, 753, 754 and In multiple director* 77ft, 780. 

W» MCAbuawpportaeOtoieOMBper second buret transfer rate and therefore Is a sood match 

emwwcy 19 wmc 1 *w than that For systems that have even more demanding vo requirements, 
rmjfipto »pper& and MCA buaea can be utilised. These technique* art seen to be important topmcesaore 
ttal «otild support a large external storage associated with nodea or duaiera. as might be characteristic of 
debase mschinea. Such 1*5 growth capability Is completely unique to this machine and has not protouslv 
M !T iPOfated " • ah8r rn8SsK,8ly P"^ 1 ' convertoaa single processor, or coaree^rained parallel 

Array Director Architecture 

<>r massively pares* system £ rr.e<ie up nodal building blocks of multiprocess nodes, dusters of 
node* and I arrays of PMEs already packaged in dusters, For control of ^ 
S^t^T^f .* eCtor " M * ** Pedorms toenail Process!^ 

222 jurtCS ° n4 * *• m **^ processing envlronZTrnl 

Director comprises of three functional smb. the Appfiotfon Interlace, die Ouator Synchronizer and 
«y . Cluster ConboPer. The Array Director w* ^ „ e werell ^ „ J pme^SbS 
broadMat ** and our zipper connects 10 «aar daia and commands to an of the PMEa TiTXS 
Director functions as a software system Interacting with the hardware to perform the role at die shell of the 
opereSng system. The Array Director In performing this role receives commands from the epcficatlen 

T£Z!Jm ^li n*™*** *w hardware sequences to occompsah the 

dadgn^j tasfc The Array Dkector-a main function la to continuously feed the Instructions to the PMEs and 
routedato 10 optimal sequences to keep the traffic at a maximum and collisions to a minimum. 

i* HZ/Z, t < ^ tSr . 8 - Sten1 shww ,n F|GURE « •» ««**a in mow detail to the diagram of FIGURE 
K «Wr* ttatrafes the Array Dtaactor which can function as a conkotler. or array control itotaM In 
G ^L 13 *l nQURES 18 •* 19 ' ™» *««y D^«Or 610 Hlustrefed in FIGURE "fiStl! 
as Preferred embodiment of an APAP In a typfcat configuration of n identical array clusters 665. 670, 680, 680 
*!h an array director 610 for the clusters of 512 PMEs, and an explication processor Interta far 

If^J^? 640 81X1 lC8etbef ^ «* fhe "Ana, Director" 610. The application pressor 

^tJZ* Jf*. ^H?*** * Btaftd ^ machines, the Array 

Rrector becomes trie host unit and accordingly becomes bwolvccl in unit I/O activities. 

figure i?" ° ir * Ctor conster of ** toltow,n9 tow " "w* <s« the functional block diagram In 

* 1. Application Processor interface (API) 600, 

2 Custer Synchronizer (CS) 650 (0 x 8 array of durters). 

3. Cluster Comroter fCC) MO {Bx I array of nodes). 

4. Fast VO (zipper Connection) 320. 

50 The Application Processor Interface (API) 630: 

When operating In attached modes, one API wU) be used for each hose Thai API wttl mentor the 
tocommg data stream to determine *het are Instructions to the Array clusters 835. 670. 680 6fTeTv*M 
« FS * k ° ^ 62 °- ^ in ^ tha API senJlTe ^ 

obe T ma 9 2^^I^ '^r^' «* ^ **> W proces*« wtthm the Array Director. 
P^J™ T* 0 ^*?? 8 for AP| »« conmanda. Instruction* received torn me host can call 

tor execubon of API subroutines, loading of API memory with additional functions, or lor loading ofCCand 
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PME memory with now S"W. As described in fa 3 VV overview/ section, these various type requests can be 
resfrlcted to eubeet oF users via the imklaJ programs loaded Into die API. Thus, Hie operating program 
loaded wttl determine the type of support provided which can be labored to match the performance 
capability of the API This further permits the APAP to be adjusted to me needs of multiple users requiring 
5 managed and wen tested services, or to the Individual user wishing to obtain peak performance on a 
particular appQcatton. 

The API also provides for managing me path to and from tire UO zipper. Data received from the host 
system In attached modes, or from devices in standalone modes Is forwarded to me Array. Prior to initiating 
mis type of operation the PMEe within the Array wwch wffl be managing the M> are Initiated. PMEs 

70 operating in MIMD mode can utilize Ihe fast interrupt capabiftly and either standard SAV or special tonctbns 
for tws transfer whee those operating; in SIMD modes would have to be provided detailed control 
instructions. Data being sent from the VO zipper requires somewhat the opposse conditioning. PMEs 
operatmg in MIMD modes must signal (he API via the high speed serial interface and await a response from 
the API, wrffle PMEs in SIMD modes ere already In synchronization with the API and can therefore 

is immediately output data. The ability to system switch between modes provides a unique ability to adjust the 
program to the application. 

Ctustgf Synchronizer (08) 650 

no The 03 650 provides the bridge between the API 630 end CC 64a ft stores am 930 output in FIFO 
stacks and monitors the status being resumed from ihe CC 650 (both parallel Input acknowledges and high 
speed serial bus data) to provide tfie CC. to emery fashion, with the desired routines or operations that need 
to be started. The CS provides the capebffiy to support different CCs and different PMEs wtthln dusters so 
as lo perrml dividing the array into subsets. This is done by partitioning the array and then commanding the 

£s invoked duster controllers to selectively forward the desired operation. The primary fgncson of the 
synchronizer is to keep all clusters operating and organized such that overhead time is minimized or buried 
under the covers of PME execution time We have described how ihe use of fre duster synchronizer in A- 
SIMD configurations is especially desereWe. 

30 Cluster Corrtroaar (CC) 640 

The CC 640 Interfaces to the node Broadcast and Control Interface (BCI) 606 lor the set of nodes in an 
array cluster 665. (For a 4d modified hypercube with 8 nodes per ring that means me CC 340 Is attached to 
64 BCto 605 in an 8 by 8 array of nodes and is controlling 512 PMEs. Stxty^om such dusters, also in a 8 

as by 8 array, lead to the ful up system wfth 32788 PMEs.) The CC 640 wi8 send commands and data 
suppled by fre C$ 650 to the BQ parallel port and return the sckno^ec^ement data to (he C$ 650 when 
operating in MIMD modes, in SI MO mode the interlace operates syrtfwonousiy, and step by step 
acknowledgments are not required. The CC 640 also manages and monitors the Wgh speed serial port to 
determine when PMEs within the nodes are requesting services. Such requests are passed upward to the 

40 CS 690 white trie raw data from the high speed serial interlace Ie made eraifebte to the status otspiay 
interface. The CC 640 provides the CS 650 with an i nt erteco lo specific nodes within the cluster via the 
standard speed serial Interface. 

In SIMD mode the CC wot be directed to send Instructions or addresses to ail the PMEs over tne 
broadcast bus. The CC can dispatch 18 bit Instruction lo all PMEs every 40 nanoseconds when in SIMD 

46 mode. By broadcasting groups of native instructions to the pme, the emulated instruction set Is formed 

When In MiMD mode the CO wil* waft for die snoop signal before issuing new instructions to the PMEs, 
The concept of the MIMD mode is to build airings of mkao-routinea with nafive instructions resident in the 
PME. These strings can be grouped together to form the emulated instructions, end these emulated 
instruction can be combined to produce servwearmed routines or library functions. 

so Vim en In SIMD/MJMD (SIMIMD) mode, the CC will Issue instruction as If In SIMD mode and check for 
endop signals horn certain PMEs. The PMEs mat are in MIMD wdi not respond to the broadcast instructions 
and vriO continue with there designated operation. The unique status Indicators wdi help the CC to manage 
this operation and determine when and to whom lo present Ihe sequential instructions. 

55 Operational Software Levels 

Th* application overviews the operational software SW levels to provide further explanation of tne 
services performed by various hardware H/W components, 



35 



BP 0 570 729 A2 



Computer systems generally ussd ho*e an operating system. Operating system (cornels which am 
relatively complete must be provided In most nwearw MIWD machines, where workstation cfcss CPU chine 
ru^ne!s««h as u«*>. The opiating system kemai «,.ppofJ9 passing or memory coherer**. 

Other mass** pastel systems tend upon S1MD models have almost no mtettgence si toanay. Tfwe 
J^^ 09 ™ 1 " ""ntera" o«l in Ite array, and thus no pmgrems to executel^aily. AB insfauci-Jare 

Ita the systems m haw pnwfdad with our PME as the basis for dusw arays, tfwra b not m»d tor an 
^^TUL^^' ,t"2ft^ P* 0 ** » «^«y of fey functtons for computation and/or 

„ ^SStL^ T J ^ *? * lnw,sed « 8 * h SIMW *» Mesons are 
modo one or more erf these library routines. In addSon. teste irnerupt handler end oommuMeeOons routines 
are resident in each PME stowing the PME to hands ccntmunkatton on a dynamic t«sfeOni« eadsfog 
MMD marfrrnes. me APAP structure need not include an enure program in PME memory. *J^J7Z 
code, «Mch is essendely serial, is the cluster controller. Thus wch code, 90% by apaoe a* 10% by time 

" SSSL'T t "r^f m Z ,ast * 0 to m ***** of PMEs. Only the truly pardtel inner locpsTe 
dWttuted to On PMEs In a dynamic fashion. These an then initiated Mo MMD mode Just as otter 
nteenr routines are. This enables use of program models which are Single Program l*fiple data to be 

^l^T 9 ,^ * a ™ pngFam 18 <wated PME node, with embedded ^Kfrafefen code "end 

executed at the teed PME. Design parameters effect bandwidth available on different nto^vZZZm 

P 2 W tLl^ ^ *" PME *«* IW « *> — bandwidth rsp^aTas^S 

£r S S"* XT£!£ chip mate d^y^th each Cher. St » m 

tL^J^J^,^ Sf^L*** **• ? W ^^Kmed omttM as to Mother in*, 

« ^L^L^L!lf 5? ** Sy3rtm Can a °< intomonnedion topologies, wflh routing 

a performed dynamically end programmetry. " a 

^fJ^IK^L!? 8 ** ^.f^ ""V 0 ** ^ P**» to create an executable pamM program 

ZLT^^t^ZZ I^^T" «*« code for a^te PmJ, m 

Mutate Date system w«*J pass though program analysis, tor examination of dependency, date and 

ao SSlSI^^ * ^ dep^A'KS^^ 
Ttteradte-. program ttoiwiormaflon «ouU ooc« whereby a modified version of the program would be 
created ttatrcctemte the degree* of parallelism by combing sequences or ^£pZb 
generate expSeft compiler directives. A next slop would be a data alioudlon end partHionina slea with 

s. Z^iT ' t^" 0 *!"*** P**^ «***» » elementeTbe coined 

^l^ 0 ^"' »f*«*»8 Pattern, and these woUd provide embodded programlomp^ 

dfcecftws and calls to communlcaUen services. M (his point the proonm would oats to a lwoi JZ» 

COhfTTOUER (arm, director or dueler controller), and HOST. Array portions would be interleaved In 
M ^ **any rogulrod message pwskrg syr^hroctar**, ^^k^^J^^^Z 
SW T ^ P* 38 to a <»^^ (FORTIWO ^ armory complWcoS 

£S2Lt « ^ pa33toaparser FORTRAN 0ft d to an intermediate ^ 

IS* o^^u^i^J? 0P * nbW ' Ct " ta "* coda. PME coda 

« ^ = »«=TS1? ^ m " CWr * leVal ^ wouU libr«> extensioos. which «ouM pass on load 

Durin ? ( e)fecu f on i a PME P^ei progmm m the spt^d process of execution ootid cai 
H»n already coded assembly service functions from a runtime library kerml. 

ac e. 8 ^"^ " W *" ftd,Cfi0n M * ltter * unit * i*™** w «o«»ly coupled w«h He host or 

!L! ^J*°*? I ^ 0cewor, varfaSw ln th » "PP* tevei ®w nwiei* exists, ifowever, these varieiions 
^ S V2n ° US ^ z^*®™* «> e» » permit a singte set of tow* | STO | frictions to satishr 

Z- wkWn «» « the Aaay Director is a user option. Host level interpretation permits the 

SSS^ to ^ ln, ^ a, ^T r J ,n ^ bmBcWn 9 bu! limit the apap eno to perform 

multHjaer management tasks. This leads to the concept (hat the APAP and host can be tightly or basely 
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coupled. 

Two examples ffiustrate the extremes: 

i. When APAP i$ attached to 3090 das* machines with Floating Posit Vector FaciBSas, user data in 
coir pressed form could be stored within the APAP, A host program that called for a vector operation 

a upon two vectors with differing sparseness character! sites would then send instructions to [he APAP to 
realign tte data into element by element matching pairs, output the result to m& Vector Facility, read 
answer from the Vector Fad Sty end ftneRy reconfigure data into final sparse data form, Segments of the 
APAP would be interpreting and building sparse matrix bit maps, while o#ier sections would be 
calculating how to move data between PMEe such mat it would be property aftgned for the zipper 

w 2. With APAP attached to a smafl inflight military computer, the APAP could be performing the entire 
workload associated with Sensor Fusion Processing. The host might initiate the process once, send 
sensor data as a was received to me APAP and then was tor resime. The Array Director would then have 
to schedule and sequence the PME array through perhaps dozens of processing steps required to 
perform the process. 

;s The APAP wiD support three levels of user control: 

1. Casual User. Sme works with suppBed routines and Dbrary function. These routines are maintained at 
tie host or API level and can be evoked by the user via subroutine caSa within hte program. 

2. Customize* User. She can write specta) functions which operate wtthm Vie API and which direcfly 
evoke routines supplied wtth the API or senates supplied vrfth the CO or PME, 

20 3. Development User. £ha generates programs for execution m the CC or PME, depending upon API 
services fur program load and status feedback. 

Satisfying these three user levels in either closely of toosety coupled systems leads to Oie partitioning 
oi HW cortrol lasted 

2$ API Software Tasks 

The application program interlace API cantatas 8W services that can test the leading words of data 
received and can determine whether mat data should be Interpreted by the APL loaded to some storage 
within the Array Director or PME, or passed to the I/O zipper. 

so For data that is id be Interpreted, the API determines the required operation and Invokes toe function. 
The most common type operation would can tor die Array to perform some function which would be 
executed as a result of API writes to tfts C$ (and indirectly to the CC). The actual data written to the C&'CC 
would In general be constructed by the API operational routine based upon the parameters passed to the 
API from the host. Data sent to the CS/CC would In turn be forwarded to (he PMEs via the node BCL 

35 Data could be loaded to ermer API storage, CC storage, or PME memory. Further, das to be foaded to 
PME memory could be loaded via either the I/O ripper or via the node BCL For data to be put into the API 
memory, the mcommg bus would be read then written to storage. Data targeted to the CC memory vouti 
be similarly read and men be written to foe CC memory. Rnafcy. data for the PME memory On fliis case 
normal? new Of additional MIMD programs) could be sent to ail or selected PMEs via the C&CC/Node 3fl 

<* or to a eubeet of PME* tor selective redistribution via the I/O zipper. 

When data is id be sent to rhe I/O zipper, U could be preceded by inline commands that permit the 
PME MlrVD programs to determine Its ultimate target or, it could be preceded by calls to the API service 
functions to perform either MIMD Initiation or SIMD transmission. 

In addition to responding to requests tor service received via the host interface, the API program will 

46 respond to reouest from the PMEs. Such requests uffl be generated on the high speed serial port and wffl 
be routed through the CCCS combination. Requests of this sort cen result in the API program's directly 
servicing (he PMEe or accessing the PMEs via the standard speed serial pert to determine further qualifying 
data relative to *e service request. 

30 PME Software 

The software plan Includes; 

o Generation of PME resident service routines (that is, 'an extended ISA*) for complex operations and 
vo management 

ss o Definition and development of controller executed subroutines thai produce and pass control and 
parameter data to the PMEs via the Bd bus. These subroutines: 
i. cause a set of PMEs to do mathematical operations on distributed objects, 
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2. provide 1© data management end synchronization services for PME Army and System Bus 
Interaction*, 

3. provide startup program toad, program overlay and program task management far PMEs. 
o Development of date aHocaScn support seivicw tor host te*M program* and 

5 ° %T*?Zl?. * f^grammlrtg support sysftm tekrfng essamtitr, simutator, and H/W monitor *d 

^ Based upon studies of mHtary sensor fusion, opamtofon, image hanafomutfon, us Post Office oofcal 
character recoy&on and FBI fingerprint manning cpplcetions, we have concluded thsi a parallel processor 
programmed wtth vector and array commands (Ste BLA8 calls) vooukl ba effective Tha 
re ™»™*ng mode, must m^ 

o PMEs can be Indepertdenl stored program processors, 
o The array can nave thousands of PMEs. and ba suitable for fine grained peiatelism 
o Inter-PME networks will have wy hirjh aggregate bandwidth and a small Toafcal dfcaneter' 
« o_ But, by netwM* connected microprocessor MIMD standards, each PME to memory KmBod.' 

^^^^11^ Processor* has used task dispatching rnefcodoiogy. Such 
apprwawa lead to each PME needing access to en portion of a large program This characteristic, In 
%Z£T^ ZTT^JT^ 01 the WW, PME r^^any 

tignifcanf pnrAtom We therefore target what we beBevs la a new programming model called 

1891 <rf P. Kogge, which is incorporated herein. nwumcer *r, 

*S«MD IMOMmmeto m our APAP design means ihai a group of PMEs will be directed by commands 

SZT^JTZi* 8,MD . m0dalS 71,8 bfMdMrt c — ^ Mate execution of a KTcC 
^ ^ PME Th* execvSon can invoke data dependent branching and addressing with* pmes, and 
VO bewrd synchronization with either other PMEe or Jh» BC1. ewrussng W pmes, and 

^Normally. PMEe MB comptete the processing and synohrontzs by reeding the next command tan the 

.^r^ ^ MWD SMDoperattng motto. Since the approach Impoeea no 

"TJ^^^Jl^Tl^VT*' a - PHE 0pereflon ******** on data transfer, 
"Oderirctely can be imtlaisd. Such uncScos are very effective in data fitafno. DSP end 

C£ ST,^ 7 ™ * 60 « ^ command. overTL^cinS 

SL^J^^JT^ f v ; A ' SlMD ""to" **** *« does not include mimd Mode 
,? ^ *°y *rttto PME* ne*ve inafrueaona. These instructors are 

^SZZl^JT?*" d9C ° de ^ <* *° PME - EBmir«Ung the PME rnstmcfion 
higher performance mode lor tasks that do not invotwe data dependent branching 

^ ™*,^?^ model jauppotsd by HW features) extends to pemttfrg she array of PMEs to be 
dhrldedlnto Independent sections. A separate A-SMD command stream controls each socflon Our 
appBcstoon stud.es sho« that programs of interest divide into separata phases (*. input, input tafferiog, 

parafleBsm resulte from applying the n PMEs In a section to a program phaeo. Apwytno^oeraeianed 

S^ ^J^1 S 1 ^ SiMD processing. We progmm the MIMD portions using convZonai 
» toe temfl,nln 9 f* 88 " «• A-S«» sections, coded *im vectorized comrrumds. 

™^ oL2f a,ray " m,oae ' ^ «* ollw memory the program store. Va^g the 

The progrsmmrng nftodel slso oonsWers altocataig data element* to PMEs. The approtcn is to distribute 

<Z17T™ m *!" 8 01 **' ** to d0TO «V *• Programmer or by 4W. 

wejewgntee that fte IBM paralleling compiler technologies apply to this problem end we expect to 
investigate *e*r usags. However, the irter-PME UmMdlr. provided does tend to reduce the impoS S 
fh« approach. Tbia links dale allocation and I/O mechanism performance. ^ponarwy a. 

The i HW rwmrea »wt the PME Initiate data transfers out of Its memory, and it supports a comroBed 
writel hto PME memory wKhout PME p^grem Invorvsment Input control cecum in JShS to 
ZTf Z^l^' aMtm and 3 ma * lmum ' Bn 0'h- When I/O lo a PME results ZSoZZ 
^ZllT ^ ^'1° P ^ to ^ 1/0 *»» be developed fc puIXZ 

J^^f JlT^^^ 01 ™ *** boftrean ad ^ rt PMEe or movement 

^*«s^*f« any pmes. The last cspaUBty depends upon the circuit switched and store and 

tewed ntechanem*. The Wet sddrese and forward oper^on is importam for pertormsnoe We have 
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optimized the H/W and S-W to support the operation. Using on© word buff sra results in an interrupt upon 
receipt of address header. Comparing target Id with local kJ perrrUte output path sotectton. Transfer of the 
subsequent data words occurs in circuit switched mode. A slight variation on thte process using larger 
buffers results In a store and forward mechanism. 

0 Because of the high performance Inter-PME bandwidth. It is not always necessary or desirable lo place 
data etemeitis wtrnm the- PME Array carafiuSy. Consider shitsng a vector date dement distributed across 
PMEs- Our architecture can send das without an address header, thus, providing tor very fast LO. However, 
we have found. In many applications, that octtmteJng a data structure to movement tn one direction* 
penalizes data movement to en orthogonal direction- The penalty m such equations approximates me 

10 average cost of randomly routing data in the network. Thta leads to applcations where placing data 
sequentially or randomly {as opposed to arranging data* results to shorter average process tunes. 

Many applications can be synchronized to take advantage or average access time. (For example, POE 
relaxation processes acquire data from a neighborhood and thuB. can average access over at least four MO 
operations.) We believe that after considering the factors applicable to vector and array processes, ifce 

is scajtsffcether or row'column arithmetic, many users will and brute force data allocation to be suitable for the 
appucatton. However, we know of examples that IDuslrate application characteristics (ftfce required synchro- 
nization or biased utilization of shift directions') that tend to force particular data aflocatton patterns. Thte 
characteristic requires that tiro tools and techniques developed support either manual tuning of the data 
placement, or simple and non-optimum data allocation. (We win support the non-opUmum data allocation 

20 strategy wSh host level macros to provide near transparent port of vectorized host programs to the MPP. 
The H/W Monitor workstation will permit the user to investigate the resuham performance. > 

FIGURE 10 shows the genera! S/W development and usage environment The Host Application 
Processor is optional in that program execuaon can be controlled from sifter the Host or toe Monrtor. 
Further, the Monitor wiD effectively replace the Array Controller is some ertoctions, The environment witi 

20 support program execution on real or simulated MRP hardware. The Monitor is scenario driven so that the 
developer doing tost and debug operations can create procedures to permit effective operation fit any level 
of abstraction. 

FIGURE 20 Biustrates the levels of HAW supported within me MPP and the user interfaces to these 
levels. 

so Wb see two potential application programming techniques tor the MPP. hi the toast programmer 
intensive approach, appficanona would be wraten in a vectorized high order language, if the user dW not 
feel that the problem warranted tuning data placement tften he would use compile erne services to allocate 
data to the PME Array. The application would uee vector caas like BLAS lhal would be passed to the 
controller for interpretation and execution on the PME Array, Unique calls would be used to move data 

38 between host and PME Array. In summary, the user would not need to be aware of how the MPP organized 
or processed the data Two optimization techniques will be supported tor this type eppttcalton: 

1. Altering the data allocation by constructing m data altocatton table will permit programs to force data 
placements. 

2. Generation of additional vector commands tor execution by the array controller win permit tuned 
<s etftuncttone fle. calling the Gaussian Elimination as a single operation.) 

We also see thai the processor can be appoed to specialized applications as in those referenced in the 
beginning of this section, m such cases, code toned to the epoflcatton would be used. However, even in 
such appitcstkms the degree of tuning will depend upon how important a particular las* Is to the application. 
It is in this situation that we see the need for tasks individually suited to SIMD, MIMD or A-8IWD modes* 
These programs will uses combination oft 

1. Sequences of PME native Instructions passed to an emulator function within the array controller. The 
emulator wll broadcast the Instruction and Ms' parameters to the PME set The PMEs In this 61 MD mode 
wia pass the instruction to the decode function, simulating a memory fetch operation. 

2. Tight inner loops that can bo I/O synchronized will use PME native tSA programs- Alter initiation from 
so a SIMD mode change, they would run continuously in MIMD mode. (The option to return to SIMD mode 

vie a 'RETURN* instruction exists.) 

3. More complicated programs, as would be written in a veexortafng command set, would execute 
subroutines In the array controller that Invoked PME native functions. For example a simplified array 
comrolter program to do a BLAS 'SAXPY* command on vectors loaded sequentially across PMEs would 

58 

1 Gaussian Elimination wlrh normal crvoting requires shifting rows buf not columns, More than £1 performance 
difference would result from arranging rhe data such that columns ware on the fast shir? direction. Even wirh 
thai tnare is not an advantage to arranging rows in any particular relations?] ip to the Puses. 
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start sequences wfthin the PMEsfcat 

a. Enable PMEb with required x elements vte comparison of PME Id wlih broadcast %tox' and 
'X^aoW values, 

b. Compass She x vatws via a wrrto to ccnseajuVo PMEs, 

a Calculaie the address of PMEs wfch y elements from broadcast date. 

d. Transmit tns compressed x data to the y PMEs, 

e. Doa^ngte precision taring point operation in PME. rec*Mng x vates to complete the operation. 
Ftaefly. the SAXPY example Oustretos one additional aspect of executing vectortaed application 

proo^ns. The major steps execute m the API and could oe progrernmed by either an optimizer or product 
developer. Ncrmasy, the vectorial application would call rather than include this level o code. These steps 
would be written as C or Fortran code and wB] use memory mapoed read or v^estocoritrolthePME array 
via the 6CI bus. Such a program operates the PME array as a series or MIMD steps syricftronfced by 
returns to the API program. Minor steps such as the single precision Healing point routines would be 
developed by *e Customer or Product Developer. These operations win be cooed using the native PMC 
ISA ssid wSI be tuned to the machine characteristics, ki oensraJ, this would be me domain of the Produc? 
Developer since coding, test and optimization at this level require usage of the complete psoduci 
development tool set 

The APAP can have appfcatione written In sequential Fortran. The path is quite dHferent FIGURE 21 
outlines a Fortran compiler which can be used. In the irsl step. Ii uses a portion of the existing paralleling 
(»mpiier to develop program dependencies. The source plus these tables become an input to a process 
that uses a characterization of the APAP MMP end ths source to enhance parallelism 

This MMP is a m>Mhered memory machine and as such allocates data between the PMEs for locai 
and g**ej memory- The very fast data transfer for** and the high network bandwidth reduce the time 
oBtect of data aUocabon. but it stil is addressed. Our approach beets part of memory aa global and uses a 
SAW eewice f uncaon. it is also possible to use the dependency Wormation to perform the data allocation In 
tS^.fT^^ converting the source to muEpi* sequential programs is performed 

by the Level Partitioning step Thfe pertKioning stop is analogous to the f Fortran sup 3»f work being 
cwaoucted wtth DARPA funding. The last process in the compilation to generation of the executable code « 
sJHrtdivKhifli functional levels. For me pme this m be done by programming the code geneaar on an 
existing compter system. The Host and API code compilers generate the code targeted to those machines 

The PME cen execute MaYD software from its own memory. In general, the mutate PMEs would not 
b» executing totally different programs but rather would be executing the same small program in an 
asynchronouB manner. Three baste type* of S/W can be considered enough the design approach does not 
Rmit ths APAP to just these approaches: 

1-Speciafaed emulafton functions would make fts PME Amry emulate the set of services provide by 
standard user loranes like unpack or vp$$. h such an emulation package the PM= ArraywuW be 
using its muitipte set of devices to perform one of ma operator* required in a normal vector can. This 
type of emulation, when attached to a vector processing unit, oould utilise iho vector unit for some 
operator* while performing others InternaUy. 

2, The peralleHem oi the PME Array could be exploited by operating a set of sofiware that provides a 
new set of mathematical and service functions in the PMEs- This set of primitives would be the codes 
ejqpioited by a customizing user to construct his application. Tne prior example of performing sensor 
fasfcn on a APAP attached to a military platform would use such an approach. The customteer would 
write routines to perform Kalman Filters, Track Optimum Assignment and Threat Assessment using the 
supplied set of function names. This eppi ication would be a series of API cen statements, and each call 
woutt result in initiating the PME set to perform some basic operation iik» 'matrix muttWy* on data 
stored within the PME Array. 

3. In cases where no effective method, considering performance objectives, oi applfcegon needs exists 
then custom SW could be developed and executed within the PME. A specific example is l Sorf Many 
methods to sort data exist and the objective in at) cases is to tune the process and the program lo the 
machine architecture. The modified hypercubs is well suited to a Batcher So* however mat eon 
requires extensive calculations to determine particular elements to compare versus very short compari- 
son cycles. The computer program in FIGURE 17 shows a simple example of a PME program nfjQ to 
^^1^^ forth with one etement per PME. Each fine of me program description would be 
«p«xted io 3 to 6 PME machine level mstructbns, and a* PMEs would fan execute the program In 
MM) mode. Program synchronization is managed via the I/O statements. The program extends to more 
daia elements per PME and to very large parallel sons m a quite straight forward manner 
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CC Storage Contents 

Data from the CC s^oiage is used by the PME Array tn one of two manners. Whan the PMEs are 
qporaiing in 8IMD, a series of instructions can bo fetched by the CC and passed to the nod* BCt. thus, 
s reducing load on both the API and CS. Alternatively, functions that are not frequently required, such as PME 
Fault ReoonfJQuraiton SrW. PME Diagnostics, and perhaps conversion routines can be stored in flte CC 
memory. Such (unctions can men be requested by operating PME MIMD programs or moved to the PMEs 
at the request of API program directives. 

10 Packaging of the gjAfay Modified Hypercube 

Our packaging techniques lake advantage of the eight PMEs packaged in a single chip and arranged in 
a N -dimensional modified hyperouba conflguration. This chip level package or node of the array Is the 
smenest building block m the APAP design. These nodes are ften peckaoed In on 8X a array where the *- 

;s X and the +-V makes rings wtthfn the array or duster and the +-W, end *-Z are brought cut to thu 
neighboring clusters. A grouping of dusters make up an array. This step signfficantiy cuts down wire count 
for data and control for the array. The W end Z buses mD connect to the adjacent dusters and fcnn Wand 
1 rings to provide total connectivity around the completed array of various etae> The massively paralM 
system be comprised of (hose cluster building blocks to form the masstoe array of PMEs. The APAP 

so wffl consist of an 8 x 8 array of clusters, each ctoster will have lis own controller and all the coiners «B 
be synchronized by our Array Director. 

Many trade-offs of vtbeebftty end topology have been considered* yet wffli these considerations we 
prefer the configuration which we illustrate with this correction. The concept disclosed has the advantage of 
keeping the X and Y dimensions vriihm a cluster level of packaging, and distributing the W and Z bus 

£9 corrections to all the neighboring clusters. 

After fmpl&mentmg the techniques described. ^3 product will be v/ireabb, and rnanufacturable white 
maintaining die Inherent characteristics of the upctogy defined. 

The concept used here Is to mix, match, and mcoliy topologies at different packaging levels to obtain 
the desired results in terms of wire count For the method to deine the actual degree of modification of the 

so hypercube, refer to the Rolfe modified hypercube patent application referenced above. For the purpose ol 
mis preferred embodiment we will describe two packaging levels to simpffy our description, it can be 
expended. 

The first Is the chip design or chip package Wustrated by FIGURE 3 and FIGURE 11. There are eight of 
the processing elements with their associated memory and oommunfcef on logic encompassed into a single 

38 chip which is defined as a node. The internal configuration is classified as a binary hypercube or a 2-degree 
hypercube where every PME is connected to two neighbors. See the PME-PME communication diagram in 
FIGURE 9. espedaPy 500, 51 0, 520. 530. 540. 550, 560, 57a 

The second step is that the nodes are configured as an ft X & array to make up a cluster. The fully 
populated machine Is buUt up of an array of 8 X & clusters to provide the maximum capacity of 32798 

40 PMEs. These 409$ nodes are connected in an 8 degree modified hypercube network where me commu- 
nication between nodes ts programmable. This ability to program different routing pafts adds flexibifey to 
transmit different length messages, m addition to message length differences, there are algorithm optimiz- 
ations that can be addressed with these programmetry features. 

The packaging concept is intended to significantly reduce the off page wire count for each of the 

<s dusters. Trie concept takes a cluster which la defined as a 8 x a array of nodes 820, each node 825 having 
8 processing elements for a total of 512 PMEs, then to limit the X end Y ring within the cluster and, finally, 
to bring out the W and Z buses to an dusters. Tha physical picture could be envisioned ae a sphere 
configuration 800, 8i0 of 04 smaller spheres 830. See FIGURE 15 for a future packaging picture which 
illustrates the M up packaging technique, limiting the X end Y rings 800 wiihin the duster and extending 

sa out the W and Z buses to all clusters 810. The physical picture could be envisioned as a sphere 
configuration of 64 smaller spheres 830. 

The actual connection of a single node to the adjacent X and Y neighbors 975 extete vrithin the same 
cluster. The wiring savings occurs when the Z and W buses are extended to the adjacent neighboring 
dusters as Illustrated In FIGURE 10, Also illustrated In FIGURE 16 is the set of the chips or nodes that cen 

98 be configured as a sparsely connected ^dimensional hypercube or torus dOO, 005, $10, dl5. Consider each 
of me 8 external port9 to be labeled as +x +Y« +z« + w« -x, -Y, -z* -w 950. 975. Then, using m chips* a 
ring can be constructed by connecting the +x to -X ports. Again m such rings can be Interconnected into a 
ring of rings by interconnecting the matching +Y to -Y ports. This level of structure wifl be called a cluster. 
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ttprojrl** tor 512 PMEs and will be the building block for sevens s«e systems. Two such connections 
(850, 975) are shown In the diagram ilueirated In FIGURE 16. 

Agptajfare for Deskside MPP. 

TT»*BKsid# MPP m a workstation can be effectively applied in severe) application areas Inducing: 
1. Small production tasks that depend upon compute intensive processes. The US Postal Service 
«*jukea a processor thai can accept a fax knap of a machine printed envelope and then And and read 

» SS^I WP ^ tenMdW, «'? ?'^^ {8 ^^te»««^olavery re petlitve 
» but suil eomputa intensive process. We have implemented APL language versions of a sample of the 
required f^arare- These models emulate the vector and aw/ processes thai will bo used to do the 
wrkon a» MPP. Based upon &ta test, we know that the task is an excellent rrwtfch id foe processing 

Z. Tasks in *hich an analyst, as a result of prior output, or expected needs requests sequences of data 
»s transformations. In an exempt* taken from the Defense Mapping Agency. sateSie images are to be 
transformed and smoothed pixel by pixel into some other coordinate system, h inch asBuaflon, the 
transformation parameters for the image vary across localities as a result cf ground etevtficn and dope 
TheanaTist must therefore add fixed control points and reprocess transforroafans. A similar need occurs 

^ Program developmem for producriari versions ol iha MPP wil use workstation size MPPs. Consider a 
tuning proofs thai requires anaJysls ol processor «nu» notwrk perfomwee. Such a taafe Is mccNae 
aj>d analyst Interactive. If can require hours when the machine to idle and the analyst Is woddno. When 
performed on a supercomputer il would be very coetly. However. provking w dtonteMe^taSon 

ine test and debug process by eliminating the programmer inefficiencies related to accessing remote 
processors.. * 

rk^^I^L^^" 9 °* "* "v****™ accelerator. It usee me same size enclosure as the 

£^2?. ^ POinl P 0 ^ 0 ^^ «• 530 MttopS of processing parer and aboul 100 

Mbjne* ofTO • banttoxa. to ihe array. n»e unit would be suitable for any of foe prior applications. With 

SL^^r 1 ^J^™ 80 ' 6000 ' « be price compasbte^ h£h perfonrtZ 
worfcetetione, not at the price of comparable machlriea employing old technology. 

a Description of foe AWACS Sensor Fusion 

The military environment provides a series of examples showing the need for a hardened compute 
^tensive processor, 

as ^T^^^^^!^^' 0 "^ *e need fo, dtgitaBy encoded commute- 
EZL a £ "l*** 8 ** m$ - ^ iwwwa of encoding the date for transmission ami recovering 
w T^ to lmW,6iW PWW * ^ be don* wis, specked sig^l 

pioceewig modules, but for sltuatrons where eommuoicatton encoding represents bursts of activity, 
spedafeed modules ere mossy idle. Using the MPP permits several such tasks to be allocated to a slnJa 
mo** and saves weight, power, volume and cost. 

« Sensor data fusion presents a parscutoly dear example of enhancing an existing ptaroim with the 
cwnputo power gained from the addition o» WPP. On iha Air Force E3 AWACS there are more than four 
sensors on me platform, but there la currently no way lo generate tracks resulting bom the Integration of all 
av*WWe data. Rathe,, ** existing generated tracks have quite poor quality due to sampling characteristics. 
Therefore, there is motivation to use fusion to provide an effective higher sample rate 

50 ^ } Hl hB T SlUdted sensw ,uston P roWero m *W and can propose a verifiable end effective solution, 
but fiat soMion would overwhelm the compute power available in an AWACS data processor figure 23 
*ows the traditional track fusion process. The process is faulty because escort*,* individual processes 
^t*tlTJ;JZ 0 lL m Z ^J°* them instead of eliminating them. The 

^Z^^TJT^ pesente w kriprevementond the resukfog exxr^uto intensive problem with 
the approach. Although wo cannot solve a i*»-H 5 rd problem. «e have developed a good method to 

wsewnere, as k can b© em^o/ed on a variety ol machines like an Intel Touchstone wUh (80860) 



45 



BEST AVAILABLE COPY 



EP 0 570 729 A2 



processors, and IBM's Scientific Visualization System. Q can be used ss an application suitable for the MMP 
using the APAP design desorfeed here with eay 128.000 PMEa, substantially outperforming mess other 
systems. Application experiments shew me approximation quality is below the level of sensor noise and ae 
such the answer Is appttcabte to applications like AWAC3. FIGURE 25 shows th© processing loop Involved 

s in Die proposed Lagrsngean Reduction n-dtanensiorid Assignment algorithm. The problem uses very 
controfed repetitions of the wen known 2-dimentfona! esslgoment problem, the same algorithm tlwt 
classical sensor fusion processing uses. 

Suppose tor example that the n-dimonslonai algorithm v*es to be applied to the seven sets of 
observations Illustrated in FIGURE 24 end further, suppose that each peas through a reduction process 

io required four iterations through a 2d Assignment process. Then the new fld Assignment Problem would 
require 4000 iterations of the 2d Assignment Problem. The AWAC$' workload io now about 90% of machine 
capacsy. Fusion perhaps requires 10% of the total effort but even that smaQ effort when scaled up 4000 
times results in tola! utffixaBon being 370 times !he capacity of an AWAC3. Not onfy does this workload 
overwhelm the existing processor, but it would be marginal In any new MIL environment suited, coarse- 

;s grained, parallel processing system currently existing or anticipated in the next few years* If the afgoritfsn 
required an average of 5 rather Man 4 iterations per step, then it would overwhelm even the hypothesized 
systems. Conversely, the WPP solution can provide the compute power and can do so even at the B 
fteratfon level 

20 Mechanical Packaging 

As HJustrated In FIGURE & and olher FIGURES, our preferred chip is configured In a ouadflatpack form. 
As such it can be brickwalled into into various 2 O end 3 O configuration* in a package. One chip of eight or 
more processor memory dements is a first level package module, the same as e single DRAM memory 
chip Is to a foundry which packages *e chip. However. It Is In a quadflatpack form, allowing coonectlpris to 
one another in four directions. Each connection is point to point. (One chip m its first level package Is a 
module to the foundry.) We are able to construct PC arrays of sufficient magnitude to hit our performance 
goafs due to this feature. The reality Is mat you can connect these chips across 3, 4 c* even five feet point- 
to-point ue. multi-processor node to node, and m have proper control without the need of fiber opaca. 

so This has an advantage for the drtveAecefve circuits that are required on the modules. One can achieve 
high performance and keep the power dissipation down because we do not have bus systems that daisy 
chain from module to module. We broadcast from node to node, but this need not be a high performance 
path. Most data operations can be conducted in a node, so data path requirements are reduced. Our 
broadcast path is essentially primarily used as a controller routing tool. The data stream attaches to and 

m runs in. the ZYVXY communication path system. 

Our power dissipation is 22 watts per node module for our commercial workstation. This allows us to 
use aw cooled packaging. The power system requirements for our system are eiso reasonable because of 
this fact Our power system illustrated multiples (he number o1 modules supported by about 2-5 watts per 
module, and such a five volt powet supply Is very cost effective. Those concerned with the amount of 

40 electricity consumed would be astonished that 32 rYdcrocemputara could operate with lese than, the wattage 
consumed by a reading light. 

Our thermal design b enhanced because of me packaging. We avoid hot spots due to high dissipating 
parts mixed with low dissipating ones. This reflects directly on the cost of the assemblies. 

The cost of our system is vary attractive compared to the approaches that put a superscalar processor 

4$ one card. Our r?erformartce level per assembly per wan per connector per pen type per dollar is excellent 
Furthermore, we do not need the same number of packaging levels that the other technology doss, Wa 
do not need mc^le/cardmssfcplane and cable. We can skip the card level if we want fax As Illustrated in our 
workstation modules, we have skipped the card level mih our brickwsifed approach. 

Furthermore, as we illustrated in our layout, each node housing which is brfekwafled in the workstation 

so modules, can as illustrated in FIGURE 3 comprise multiple repScated dies, even within ihe same chip 
housing. While norrnafly we would place one die wfthin an air cooled package, ti is possible to place 8 did 
on a substrate using a multiple chip module approach. Thus, the envlstonsd watch with 32 or more 
processors. Is possible, as well as many olher applications. The packaging and power and flexibility make 
applications whtah are endless, A house could have Its controllable instruments a* watched, end cocrdt- 

5s naied with a very small part. Those many chips that are spread around an automate for engine watching, 
brake adjustment and so on could ail have a monitor within a housing, in addition, one the same substrate 
with hybrid technology, one could mount a 388 microprocessor chip with full programmable capability and 
memory (all In one chip) and usa it as the array controller for ihe substrate package. 
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We have shown many configurations w systems, from control sy sierra, FIGURE 3. to lager and larger 
systems. The aMity to package acWpwth mufilpto proces sor memory etement of eight or more on a chip 
in a dip, wffr> pinoutt ftitmg in a standard DRAM memory module, such 89 in a SIM module make possible 
countless addftional appOcations ranging from controls to wsO size vidso display* which can haw a 
repstlflcn rate, not a the 15 or so frames thai press the existing technology today, but at 00 frames, with a 
processor assigned to monto apwai » anode afew p«ete. Our brfckwdi qusdnaipack mates ft easy 
to repticai© the same part time over and over again, Furthemwra, the replicated processor is reefy memory 
w» processor mterchange. Part of the memory can be assigned to a specific mc*talng task, and enofter 
part (with a size pro^ammsticalry defined) can be a massive global memory, addressed i>orot4o-polm\ with 
broadcast to aH capability. 

Our basic woikseflcn, our svpeicempvter, our controller, our AWACS, ai are examples of packages 
that can employ our new technology. An array of memory, ^im inbuilt CPU ch** and W, functions as a 
PME 01 massively parallel appUcaliona» and even more smiled applications, The llextbtfty of packaging and 
programrrung makes imaginations expand and our technology allows one part to be assigned to many Ideas 
and images. 

Mattery Ayjpojcs Appicatlons 

The cost advantage of constructing a ML MPP Is particularly wel illustrated by the AWACS. U is a 20 
year old enclosure that has grown empty space as new technology memory modules have replaced the 
original core memories- FIGURE 56 shows a MIL quofifiabte iwo cJusrar system that would ft directly into 
the rack's empty space and would use the existing memory bus system lor Interconnection. 

Although fee AWACS example * very advatfagsou* due to the existence of empty space, in other 
systems rt is possible to create space. Replacing existing memory with a smell MPP or gateway to an an 
teeteted MPP is normally quite viable, in such cases, a quarter cluster and a adapter module would result in 
a 8 Megabyte memory plus 640 MPs and use perhaps two slots. 

Supercomputer jgpjjcejon 

A 64 ckjster MPP Is a 13.6 Gftop supercomputer. H can be configured in a system described in FIGURE 

27. 

The system we describe allows node chips to be brick watted on cluster cards as illustrated b FIGURE 27 
to build up systems with some significant cost and size ad/fintages. There la no need la Include extra chips 
3uch as a network switch In $uch a system because it would increase costs. 

Our kxterconoection system with "brfckwaJled* chips allows systems to be built like massive ORAM 
mamory Is packaged and wfii have a demed bus adapter conforming to Oie rigid bus specifications* lor 
fnaarioe a inkrochannel bus adaptor. Each system wW have a smaller power supply svstem and cooling 
design than other systems based upon many rraxtem m^rcfjnxessore. 

Unlke most supercwiputers our current preferred APAP with floating point emulation Is much tester In 
integer arithmetic 064 GIPS) than s is when doing fleeting pott arithmetic. As such, me processor would 
be most effective when used in applications that are very character or integer intensive- We have 
cxwictered three program challenges which to addition to the other applications discussed herein are 
rwedM of solution- The applications which may be more Important than some of tns •grand chattenpes" to 
day today Ife include: * 

1* 3090 Vector Processors contain a very high performance floating point arfthmeite unit That unit, as do 
most vectorized floating point units, requires pipeline operations on dense vectors. Apportions that 
make extensive use of non-regular sparse matrices <l.e. matrices described by bit maps rather than 
diagonal) waste the performance capabiity of the floating point unit The M>P solves this problem by 
providing the storage for the date and usmg rta compute power and network bandwidth, not to do the 
calculation but rather to construct dense vectors, and to decompress dense results. The Vector 
Processing Unrt is kept busy by a continual tow of operations on dense vectors being supplied to it by 
the MPP. By sizing the MPP so ihar. it can effectively compress and decompress at the Game rate the 
vector Facility processes, one could keep both units fuay busy. 

2, Another heat attached system we considered is a solution to the FBI fingerprint matching problem. 
Here, a machine with more than 64 clusters was considered. The problem was to match about 9000 
fingerprints per hour against the emke database of fcngerorlnt history, using massive da$d and the full 
&»xJwidih o* the MPP to host attachment, one car. roil the complete date base across the incoming 
prints <n about 20 minutes. Operating about 75% of the MPP in a 8IMD mode coarse matching 
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operation, balances processing to required throughput rate. We estimate mat 1 5% of the machine In A- 
8IMD processing moda Mould Eton complete the matching by doing me detailed vertf toaiion of unknown 
print versus file print for cases passing the coarse Eter operation. Tne iwetning portions of *e machine 
were in MIMD mode and allocated to reserve capacity, work queue management and output formatting, 
6 3. Application of the MPP to database operations has been considered. Although the work is very 
preliminary, it does eeern to be a good matcn. Two aspects ot fte MPP support Oils premise: 

a. The connection between a cluster Cornroilsr and the AppScatlon Processor Interface Is a Micro- 
Channel. As such, It could be populated with DASD dedicated to &e cluster and accessed directly 
from the cluster, A 64 duster system with ab< 540 Mbyte hard drives attached per cluster would 

10 provide 246 Gbyte storage. Further, that entire database could be searched sequentially in 10 to 20 

seconds. 

b. Databases are generally not searched sequentially. Instead they use many levels of printers. 
Indexing of databases can be done wfchin the cluster. Each bank of DASD would be supported by 2.5 
©IPS of processing power and 32 Mbyte of storage. That & sufficient for bofr searching and swing 

js she indices. Since intScea are now frequently stored within the DASD, significant performance gains 

would occur. Using such an approach and dispersing DASD on SCSI Interfaces attached to the cluster 
MicroChannel permits effect very unlimited size data bases. 
FIGURE 27 Ig an illustration ot the APAP when used to build the system into a supercomputer scaled 
MPP. The approach reverts to replicating units, but here II is enclosures contaJning 16 clusters that are 
so replicated. The pantcuiar advantage of mis replication approach is that *e system can be scaled to suit the 
user's needs. 

System Architecture 

23 An advantage of the system architecture which is employed in the current preferred embodiment is the 
ISA system which win be understood by many who win form a pool for p*oo ramming the APAP. The PME 
ISA consists of the following Data and Instruction Formate, illustrated in the Tables. 

Data Formats 

30 

The basic {operand} size is the 18 bit word, in PME storage* operands ere located on integral word 
boundaries. In addraon to the word operand size, ofter operand sizes are avanabte In multiples of 16 bits to 
support additional functions, 

\tf thin any of the operand lengths, the bit positions of the operand are consecutively numbered from 
38 left to right starting wit) the number 0. Reference to high-order or most-sigrtflcant bits always refer to the 
letwnost Wt positions. Reference to the low-order or least-significant bits always refer to ihe dgrrt-most brt 
portions, 

tnssxtction Formats 

The length of an instruction format may either be 16 bfts or 52 bits. In PME storage, instructions must 
be located on a 16 bit boundary. 

The following general instruction formats are used. Normally, the mat four bits of an Instruction define 
the operation code and are referred to as (he OP bits, to some cases, additional bits are required to extend 
the deftnttoo of the operation or to define unique conditions which apply to the instruction. These bits are 
referred to as OPX bfts. 



so 
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30 



J5 



Format Code 


Operation 


! RR 


Refliefcr to Rsgtefcr 


OA 


Dta* Address 


RS 


Register Storage 


RI 


Register Immediate 


SS 


Storage to Storage 


SPC 


Special 



Afl formate have one fiekj in common. This field and Us inlerprelation is: 

BM M Operation Code - This flew, combes in conjunction with an operate code extension 
field, defines the operation be performed 

Delated figures of the individual formats along with interpretations of Urt fields are provided in the 
^^^ 0n t!^L ir **«^****™to may be oombbed to form variafcns on the 
button. These primarily involve the addressing mode f or the Instructor As en example a storage to 
storage toslructlon may have a form which Involves drect addressing or register addressing, 

RR Format 



25 



The Register-Register {RR) format provide* two general 
shown. 



register addresses and 1$ 16 has m tength as 





OP 


RA 


G 0 


0 0 


RB 




















« 1 1 


_J ■ 1 


1. 


1 1_ 


IX 


J_ 


e 3 


4 7 


B 


1 


1 


1 








1 


2 


5 



In edition to an Operation Code field, the RR format contains: 

*** ** ^^ AddreS * * - The RA field is used to specify which ot the 10 general registers is 
l& be used as an operand anchor ciseonabon. 



Bits 12*15 
49 DA Format 



- Bit « being a zero defines the format to be a RR or DA fiomiat and bits 9»n 

2^^^ * e 10 * 4 » raster operation <a special ease ot 

the Direct Address format). 

"^*^ Aadr *** 2 - The RB field is used to specify which of the 1$ general registers la 
to be used as an operand. 



The Direct Address (DA) format provides one genera) register acktese and one Oreot storage address 



as shown. 
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OP 

1 1 1 


RA 

r t i 


0 


DIR AODR 
i i i i i i 


8 3 


4 7 


8 9 1 



10 



In addition to an Operation Code told, the DA format contains: 

Bfe 4-7 Register Address 1 - The RA field is used id specify which of the 18 general rasters is 
to be used as an operand endfcr destination. 
;s B38 Zero - This bit being zero defines the operation to be a direct address op^maon or a 

register to register operalton. 
Bis 9-15 Direct Storage Address - The Dfarect Storage Address fteW Ks used as an address info 
the level unique storage Mock or the common storage btock. Bits 8-11 ot tho direct 
address Sold must be rwn-*em to define the dined address form. 



RS Format 

The Register Storage (RS) ■bcmat provides one genera) register addresses end an indsecr storage 
address. 



20 



I 

OP 

lit 


RA 

I f i 


J 


DEL 
i i 


R8 

ii i 


9 3 4 7 


6 9 ] 
1 


1 1 

2 5 



In addition to an Operation Coda Mo, m RS format containe: 



Bits 4-7 

BR a 
BUS fr-lt 

Bits 12-15 



Rl Format 



Register Address 1 - The RA fold is used to specify which of trw 16 general registers Is 
to be used as an operand end/or destination. 

One - Thte ba being one defines the operation to be a register storage operation. 
Register Data - These bits are considered a signed value which ts used to modify the 
contents oi regteto specified by the RB SeJd. 

Register Address 2 - The re fold is used to spec&y which me 16 general registers te 
lo be used as an storage address for an operand 



The Re#sfer-lm mediate (Rl) format provides one general register address and 16 bits of immediate 
cteta. The Rl format is 32 bite ot length as shewn; 
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JO 



5 


OP 


M 


1 


DEL 


8096 


IMMEDIATE MTA . 




f 1 1 






-J_l 


i i i 


U-J-LLJ l I i 



3 4 



7 8 9 



1 1 

1 z 



1 

5 



3 
1 



In addition to an Operation Coda held, the HI formal contains: 
Bit* 4-7 



39 



9-11 



( 12-18 



i 16-31 



^*^ J AcWrW * 1 ' RA «e« '» used to specify «Mch of the ie general renters Is 
tD be used as en operand end or destination. 

One • This bit being one defines the operation to be a register storage operation 

*e «"«W«ed a signed vafce whtch is used to modify the 
XlZ^ ^^-^eofonetorthe 
^o*e - The ^^a^oteussd to specsy th* the updaad prolan ocmer, 
points to the irrnnediae data tfetd. is to be used es at storage address to an operand 



as S$ Format 



immSr XhiiL^* ^S-T PwWm ^ Stora £ >e c« «pWt: and the second 

SSn^tr^.^,^ 8 b ta 0wwd i. Regwor 1 «* modified dSng 





OPX 








OP 




0 


C 


9 


01 R AODR 




0 


V 


R 








p 


F 


Y 






i r i 


—1 








'-1-1 1 1 1 



3 4 



7 8 9 



(Direct 
Address 
Forrc) 





OPX 










OP 




0 


c 


l 


DEL 


Re 




0 


V 


R 










p 


r» 

r 


r 








_L-I_1 


1 








1 1 


1 I l-r 



3 4 



7 8 9 



1 1 
! 2 



(Storage 

Address 

Form) 
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In addition to an Operation Cods field, the S3 farmai contains: 

Bib 4.7 Operation Extension Codo - The OPX Bold, together with the Operation Code, defines 
the operation to be performed. Bits 4-6 define the operation type such as ADD or 
SUBTRACT. Bits 6-7 control the carry, overflow, end how the condition code will be eel. 
Bit 3 s D Ignores overflow, bit 8 = 1 allows overflow. Brl 7 » 0 ignore the carry stat 
during tie operation; btf 7 « 1 includes She carry slat during the operation. 

Bit ft Zero - Defines the form to be a direct address form. One - Defines ihe form to be a 

storage address form. 

Bits $-15 Direct Address (Direct Address Form) - The Ofrect Storage Address field ts used as 
an address into (he level unique storage block or the common storage block. Gils 9-1 1 of 
the direct address field must be non-zero to define the direct address form. 

Bits 9-11 Register Delta (Storage Address Form) - These bits are considered a signed value 
which is used to modify the contents of register specified by the RB fold. 

Bits t2«i5 Register Address 2 (Storage Address Perm) - The RB field is used to specify which 
of the 1 8 genvra) registers is to bs used es a storage address for an operand. 



SPC Format 1 

The Special (SPC I) formal provides one general register storage operand address. 



OP 


OPX 


9 


LEU 




RR Form 


1 1 1 


I i 1 




I f 


i i i 





3 4 



7 8 



1 1 
1 2 





OP 


OPX 


1 


LEN 


RB 


38 


1 1 .1. 


f 1 1 




1 1 


.III 



3 4 



7 8 



1 1 

1 Z 



RS Form 



In addition to an Operation Code field, the 3PC1 formal contains; 



so 



Bits 4-7 
Bit 8 

Bits 9-11 



Bits 12*15 



SPC Format 2 



OP Extension - The Opx field it used to extend the operation code. 

Zero or One - This ba being zero defines die operation to be a register operation. This 

b» being one defines me operation to be a register storage operation. 

Operation Length - These bits are considered an unsigned value which Is used to 

specify the lengm of tfts operand in 16 bit words. A value of zero corresponds to a length 

of one, and a value of V corresponds to a length of sight 

Register Address 2 - The RB field Is used to specify which of ff*e 1$ fjenerel registers is 

to be used as a storage aottsss for the operand 



The Special (SPC2) format provides one genera? register storage operand address. 



56 



46 



EP 0 570 729 A2 



OP 


RA 


OPX 


R8 


_L 1 1 


-L 1 i._ 


^1 1 1 


-I-J 



& 34 7 a XI 1 

1 2 5 



m addition fo an Op&rafion Code ffeld. the SPC2 fornix contains: 

H» 4*7 Register Addrese 1 * The RA tfeW is used to specify which of the 16 general reowters is 

to be used as an operand eodAx destination. ^ 
Bte Ml OP Extension - The OPX field is used to extend the operofion code 
Bite 12-15 R«gi«t8rAddf8« 2 - The RB feU is used Id specify «hteh of the 16 gener* renter* te 

to be used as a atonane adtess foi tl* operand. 

THE INSTRUCTION LIST OF THE ISA INCLUDES THE FOLLOWING: 
Table 1 {Rage 1 of 3). Fked-Polnt Anihmelic lnsiructter« 

TYPMF 

ADD DIRECT ■«* 

ada OA 



60 



EP 0 970 720 A2 



TaWe 1 f Pag& V C4 3). Fcced-Poml Arithmetic UwlrwSona 

NAME MI4& TYPME 

MON1C 

9 





ADD FROM STORAGE 


a 


RS 




(WITH DELTA) 


awd 


RS 




ADD IMMEDIATE 


ai 


Rl 




(WITH DELTA) 


alwd 


Rl 




ADD REGISTER 


ar 


RR 




COMPARE DIRECT ADDRESS 


cda 


DA 


;a 


COMPARE IMMEDIATE 


ci 


Rl 




WITH DELTA) 


ciwd 


RK 




COMPARE FROM STORAGE 


Q 


RS 


20 


(WITH DELTA) 


cwd 


RS 




COMPARE REGISTER 


Cf 


RR 




COPY 




R3 


2$ 


{WITH DELTA) 




RS 




COPY WITH BOTH IMMEDIATE 


covbi 


Rl 




{WITH DELTA) 


Cpyhiwd 


Rl 


30 


COPY IMMEDIATE 


CDVl 


Rl 


(WITH DELTA) 


cpytwti 


Rl 




COPY DIRECT 


cpyda 


DA 




COPY DIRECT IMMEDIATE 


cpydai 


DA 


38 


INCREMENT 


inc 


RS 




(WITH DELTA) 


Incwd 


RS 




LOAD DIRECT 


Ida 


OA 


40 


LOAD FROM STORAGE 


1 


RS 




(WITH DELTA) 


iwd 


RS 




LOAD IMMEOIATE 


ii 


Rl 




{WITH DELTA) 


Hwtf 


Rl 




LOAD REGISTER 


Ir 


RR 




MULTIPLY SIGNED 


mpy 


SPC 


30 


MULTIPLY SIGNED EXTENDED 


mpyx 


SPC 


MULTIPLY SIGNEO EXTENDED IMMEDIATE 


mpyxi 


SPC 
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We 1 <P^e 3 of 3), Fued-Poim Anil*™,* fiwi^ctions 
NAME 

MULTIPLY SIGNED IMMEDIATE 

MULTIPLY UNSIGNED 

MULTIPLY UNSIGNED EXTENDED 

MULTIPLY UNSIGNED EXTEMOED IMMEDIATE 

MULTIPLY UNSIGNED IMMEDIATE 

STORE DIRECT 

STORE 

WITH DELTA) 
STORE IMMEDIATE 

(WITH DELTA) 
SUBTRACT DIRECT 
SUBTRACT FROM STORAGE 

IWJTH DELTA) 
SUBTRACT IMMEDIATE 

(WITH DELTA} 
SUBTRACT REGISTER 
SWAP AND EXCLUSIVE OR WITH STORAGE 



TYPME 

SPC 

SPC 

SPC 

SPC 

SPC 

DA 

RS 

RS 

Rl 

Rl 

DA 

RS 

RS 

Rl 

Rl 

RR 

RR 



Table 2 (Page 1 of 3X Siorag* to storage Irwiri^k™ 
NAME 

ADD STORAGE TO STORAGE 

(WITH DELTA) 
ADD STORAGE TO STORAGE DIRECT 
AOD STORAGE TO STORAGE FINAL 

(WITH DELTA] 
ADD STORAGE TO STORAGE FINAL OIRECT 
ADD STORAGE TO STORAGE INTERMEDIATE 

(WITH DELTA) 



M!4& 

mm 

sa 

sawd 

aada 

saf 

safwd 

safda 

sai 



TYPME 

ss 
ss 
ss 
ss 
ss 
ss 
ss 
ss 
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Table 2 (Pag* 2 of 3). Storage to Storage inBtrurtionB 

NAME TYPME 

MOMIC 

ADO STORAGE TO STORAGE INTERMEDIATE 





D(R£CT 


saida 


SS 




Ann RTORAfSP TO ^TOBAfiP I OAlf^AI 


sal 


SS 


10 




4«IWU 






Ann QTORAftP TO <lTORAftP 1 OOlf At DIRECT 


VCXIXMCi 


ss 






SC 




fS 


AftllTU ttPI TAl 
\Wlln UCUIMJ 








LUmrAKt olUKAbt I U a 1 UKnbt UIKtl^ 1 


scd3 






mMPARP ^TORA^F TO <5TORAOP FIKIAI 






20 


/tA/iTu ncr tai 


SClWO 


oo 












COMPARE STORAGE TO STORAGE INTERMEDIATE 




SS 


20 


/WITH OPl TA\ 
COMPAS5P STORAGE TO STCJPAOP INTERMEDIATE 


&CIWM 


oo 




DIRECT 


scida 


ss 




COMPARE STORAGE TO STORAGE LOGICAL 


scl 


ss 


30 


(WITH DELTA) 


sclwd 


ss 




COMPARE STORAGE TO STORAGE LOGICAL 








DIRECT 


sdda 


ss 


38 


MOVE STORAGE TO STORAGE 


smov 


ss 




(WITH DELTA) 


smovwd 


ss 




MOVE STORAGE TO STORAGE DIRECT 


smovda 


ss 


40 


SUBTRACT STORAGE TO STORAGE 


ss 


ss 




(WITH DELTA) 


sawd 


ss 




SUBTRACT STORAGE TO STORAGE DIRECT 


ssda 


ss 


J6 


SUBTRACT STORAGE TO STORAGE FINAL 




ss 




(WITH DELTA) 


ssfwd 


ss 




SUBTRACT STORAGE TO STORAGE FINAL DIRECT 


ssf da 


ss 


30 


SUBTRACT STORAGE TO STORAGE INTERMEDIATE 


ssi 


ss 


(WITH DELTA) 


ssiwd 


ss 
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JO 



>3 



Tabte 2 fP*ge 3 of 3). 
NAME 



Stof 3ge to Storage iMimetams 



SUBTRACT STORAGE TO STORAGE INTERMEDIATE 
DIRECT 

SUBTRACT STORAGE TO STORAGE LOGICAL 

(WITH DELTA) 
SUBTRACT STORAGE TO STORAGE LOGICAL 

DIRECT 



MNE- 
MONIC 

saida 
ssl 

sslwd 
sslcfa 



mm 

SS 

ss 
ss 

ss 



Table 3 



25 



30 



Logical fristruotiona 



NAME 


MNEMONIC 


TYPME 


AND DIRECT ADDRESS 


Rda 


! DA 


AND FROM STORAGE 


n 


RS 


<WtTH OELTA) 


nwJ 


RS 


AND IMMEDIATE 


ni 


Rl 


<WITH DELTA) 


niwd 


Rl 


AND REGISTER 


rw 


RR 


OR DIRECT ADDRESS 




DA 


OR FROM STORAGE 




RS 


iWtTN DELTA) 


owd 


RS 


OR IMMEDIATE 


d 


Rl 


(WITH DELTA) 


ciwJ 


Rl 


OR REGISTER 


or 


RR 


XOR DIRECT ADDRESS 




DA 


XOfl FROM STORAGE 


X 


RS 


(WITH DELTA) 




RS 


W IMMEDIATE 




Rl 


(WITH DELTA) 




Rl 


XOR REGISTER 


XT 


RR 
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Table 4 |P*Q0 J of 7). Shin Instruction* 








NAME 


MNE» 


TYPR 






MOMC 




3 


SCALE BINARY 


scale 


SPC 




scale Binary immediate 








SCALE BINARY REGISTER 


scalar 




10 


SCALE HEXADECIMAL 


scaleh 


SPC 




SCALE HEXADECIMAL IMMEDIATE 


scaiefti 


SPC 


/5 


SCALE HEXADECIMAL REGISTER 




CD/* 




SHIFT LEFT ARITHMETIC BINARY 


sia 


SPC 




SHIFT LfzFT ARJTHMETiC RINARY IMMPHIaTP 


siat 


SPC 


20 


SHIFT LEFT ARITHMETIC BINARY PFGI^TPR 


star 




SHIFT LEFT ARITHMETIC HEXADECIMAL 


si an 


SPC 




SHIFT LEFT ARITHMETIC HFXAHFPlMAf IMLiFnii\TF 


siani 




20 


SHIFT LEFT ARITHMETIC HEXADECfMAl RFGIfTTFR 


sianr 






SHIFT LEFT LOGICAL BINARY 


all 
oil 






SHIFT LEFT LOGICAL BINARY IMMEDIATE 


9111 




30 


SHIFT LEFT LOGICAL BINARY REGISTER 


cllr 






SHIFT LEFT LOGICAL HEXADECIMAL 


Sllll 


5>PC 




SHIFT LEFT LOGICAL HEXADECIMAL IMMEDIATE 


slihi 


SPC 


3S 


SHIFT LEFT LOGICAL HEXADECIMAL REGISTER 


silhr 


SPC 




SHIFT RIGHT ARITHMETIC BINARY 


srs 


SPC 




SHIFT RIGHT ARITHMETIC BINARY IMMEDIATE 


srai 


SPC 




SHIFT RIGHT ARITHMETIC BINARY REGISTER 


star 


SPC 




SHIFT RIGHT ARITHMETIC HEXADECIMAL 


sr*h 


SPC 




SHIFT RIGHT ARITHMETIC HEXADECIMAL IMME- 


srahl 


SPC 




DIATE 








SHIFT RIGHT ARITHMETIC HEXADECIMAL REGISTER 


srahr 


SPC 




SHIFT RIGHT LOGICAL BINARY 


srl 


SPC 


30 


SHIFT RIGHT LOGICAL BINARY IMMEDIATE 


srli 


SPC 
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Table 4 <*ag* 2 of 2). Shift hutrgction* 










MNE- 


TVPIflg 


5 




[win Mm 






onin return LUUiCfsL BINARY REGISTER 


srlr 


SPC 




omri muni LOGICAL HEXADECIMAL 


srih 


SPC 




bHIFT RIGHT LOGICAL HEXADECIMAL IMMEDIATE 


srlhi 


SPC 


10 


SHIFT RIGHT LOGICAL HEXADECIMAL REGISTER 


srlhr 


SPC 


*5 


i acie * (rags i or 2J. Branch Instructors 








NAME 




TYPME 






MONIC 




20 


BRANCH 


b 


RS 




{WITH DELTA 1 

| vwilll UuL 1 AW 


bwd 


RS 




BRANCH DIBP^T 


bda 


DA 


35 




bi 


Rl 




(WITH DELTA) 


biwd 


Rl 




BRANCH REGISTFR 


br 


RS 


30 


BRANCH AND LINK 


bal 


RS 




BRANCH AND LINK DIRECT 


baida 


DA 




BRANCH AND LINK IMMPniATP 


ball 


Rl 




(WITH DELTA* 


baliwd 


Rl 


as 


BRANCH AND LINK REGISTER 


Mir 


RS 




BRANCH BACKWARD 




RS 




(WITH DELTA) 




RS 




BRANCH BACKWARD DIRECT 


bbda 


DA 




BRANCH BACKWARD IMMEDIATE 


bbl 


Rl 




(WITH DELTA.) 


bblwd 


Rl 


*s 


BRANCH BACKWARD REGISTER 


bbr 


RS 




BRANCH FORWARD 


bf 


RS 




(WITH DELTA) 


blwd 


RS 


50 


BRANCH FORWARD DIRECT 


bfda 


□A 
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TaMe 5 (Page 2 of 2}. Branch biGlroctWK 



NARAc 


MNE- 


TY» 




MONIC 




BRANCH FORWARD IMMEDIATE 


bfi 


Rl 


(WITH DELTA) 


bfiwd 


Rl 


BRANCH FORWARD REGISTER 


bfr 


RS 


BRANCH ON CONDITION 


be 


RS 


(WITH DELTA) 


bewd 


RS 


BRANCH ON CONOITION DIRECT 


beda 


RS 


BRANCH ON CONDITION IMMEDIATE 


bet 


Rl 


(WITH DELTA) 


bdvwd 


Rl 


BRANCH ON CONDITION REGISTER 


bcr 


RS 


BRANCH RELATIVE 


brel 


Rl 


(WITH DELTA) 


brelwd 


RS 


NULL OPMERATION 


now 


RR 



Sf«u$ Switching Instiuctons 


NAME 


MNEMONIC 


TVPME 


RETURN 


ret 


SPC 



Tabid 7 



InpvtfOutput Instruction* 


NAME 


MNEMONIC 


TYPME 


IN 


IN 


SPC 


OUT 


OUT 


SPC 


INTERNAL DIGR/DIOW 


INTR 


SPC 



SOME SUMMARY FEATURES 

The apap Machine h Fggpecjyg 

We have described m accordance Kith our invention codW be thought of m its more defaled aspects to 
be positioned In the technology somewhere between the CM-i and N-cube. Uke out APAP» the CM-1 uses 
a point design for the processing element and combines processing elements with mtsmory on the basic 
chip. The CM-1. however uses a 1 bit wide serial processor ^ile ms APAP sertes will use a 16 bit wtta 
processor. The CM series of machines started w«h 4K bits of memory per processor and itas grown to a or 
18K bte versus tte 32K by 16 brcs wo have provided tor the f'rrsl APAP chip. The CWM and fe fottow-ons 
are strictry SlMD machines while die CM-5 Is a hybrid. Instead or this, our APAP will effectively use MIMD 
opersitno modes in oonj'unction with SlMD modes when useM While our parallel 16 bH wide PMEs might 
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be viewed as a step toward the N-cube, this step a not warranted. The APAP does not separate memory 
and ""ling (run toe processing element as does the M^ite Knd Ato the APAP pttwhtes tor 

to 32K 1 8 tut PMEs while 9)9 N-cuba only provides for 4K 32 bit processors. 

, CM ^N^be^eS^ ^''^k^ ^ APAP conwpt comptetefy differs from the 

t Tlte moacad iw&cube incorporated in ou APAP is a new invention providing a vacant peaagteo 
S !^.^? d hyp8rou,w *>Po^*s. Por Instance. r^^sM^^^ 

32K PME APAP in to first preferred embodiment has a netwrk diameter of 19 logical steps end. with 
troaparency, this can tie reduced to en effective 16 logical step*. Furiher. by oorrosnaon, it a pure 

every 8 PMEs would be ecfcve wwie the remainder «ouu 6e delayed due to blockage 

AitematJvaly, considw the 64K nypercube that would be n«»d»d II CM-t was a pure hvpereube. In 

lft^tZ^«^ ItTZ?** * 18 <«* could be routed between the two 

taewstsepareted PMEs In 15 logic* step,, r <* PMEs triad to tranefe* an average distance of 7 steps 

IB nodes on a ctvp with a NEWS network: then it provides or* router furetlon wflhm the chip. The 4098 
warn ere connected into a i2d hypercube Wish no colSsroos the hybrid stil has a logical diameter of 
15. but since 18 PMEs could be contending for to, ft* rfc effective diameter is rt^Ste 

, !2L 8 ^^f! 2 of 16 PMEs ootid be adltrs. which means that B cr^tetecycteV^er Z 4 

> cycles are needed to complete ea data move*. 

The N-cube actoafty utilizes a pure hypercube, but currently only provides for e 4098 PMEs and 
^ r ZW M \ l2d <!? ** 8,92 PME9> ****** Fw *«■» togrow «> I* procXTet^h 

^Tf^ * tXidflavel ° «» connecton porta to each PME router by 25%. Although 

00 ha.ddate exlsto to 6U pport this conclusion, ft would appoer that toe N-cubo architecture ,ws «4<! 
connector pine prior to reaching a 16K PME machine. ^wwnwwurrowoi 
2. The completely integrated and distributed nature of major tasks within the APAP machine is a decided 

SSS^TJS * ** CM m . *— *** 01 ^^toh^SSSSS 

^m^>mmo * «« as aeparete unite for floating i»im wprecossors. Tf« APAP s^ combines 
ElE ^, TZTZ* ma68a89 rD0U " 9 W Clothe single point design PME. 
^2? ^ rapfcaied 8 Smea on a chip, and the chip is torn replicated 4K times to produce the 
array. This provides several advantages: t»«Juce we 

a. Uaing ona chip msans maximum size production rune and minimal system factor cootB. 

b. Regular architecture produce* the most effective pfogrammlng systema. 

0. Almost al chip pine can be dedicated to toe generic problem of Marprocesaor communication 
, ^n^O the Infer-chlp I/O bandwidth »h*h tends to be a important Srmttog factor k) MPP deaiana. ' 

tJH^T ? Jl n ^ de9f9n to ,ate rt ^«*>fly 0*ts and oStel 

tnvestmont m custom chip designs. 

DA>2?^L!Il2!rS! * !f S ° 9 Pf* perfomt¥,ce - 11 15 that APAP PME performance on 

L^.?^ 125 CyCfeS per m - ln ^ ,3S7 Coprocessor vwukl be aboufH cycles 

orty oneftoatteg point unit for wery 16 PMEs *Me In the NKjubecaae there Is probably o»»Wty« 
chip saocteted wfih eaa, of the -388 processor*. Our APAP ha 18 times as rww^^^Se 
can almost completely mate up for the single una performance delta. 

^^.^^V* 9 8 APAP PMEs wftwn a chip are constructed from 50K gates currently 
avanaDie in the technology. As memory macros shrink and toe number of gates available to the topic 
•ncreawa. Bpendaig that tncreaee on enhanced Boating point normalization should permit APAP BaaUna 
po.r« P*^ceto fer eweed the other units. Alternatively, effort could be spent to generate a PME c. 

r^ l !^'^ 9n ^l CUStom d8,l9n a W***»- onhanclng total performance white in no way 
affecting any Srw developed for the machine. 

n ™ Z*£» n out our APAP has chereciaicfJcs poised to take advantage of the future 

process technotogy growth, in contrast, the nearest similar machines CM-* and t^towWch employ a 

apa^^u?* ° f ^ <«» *i«y to use 0A8O assodatod with gro^s of PMEs TWo 

apap oawwirty, as «oa as the ability to connect displays and aux«ary storage, t, a byproduct of rtcftfca 
cSSr^T M ^ k T k V' ^ «> ports of toe Pfvte An^Thu*. APAP^SmsK 
configurabte and can include card mounted hard drives selected from one of the set ot unite thaTare 
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compatible with P&2 or RJSCSOflO units. Further, that capebBrty should be available without designing any 
additional pert number modules although it does require utilizing more repflcationa of the backpanel and 
base enclosure than does ft* apap. 

7his brief perspective is not intended to be limiting, but rather ts intended to cause those stilted in the 

5 art to review the foregoing description and examine how the many Inventions we have described which may 
be used to move me art ot massively parallel systems ahead to a flroe when programmtno to no longer a 
significant problem and the costs of such systems are much lower. Our tend ot system can be made 
available* not only to the few, but to many as it could be made at a cost within the reach of commercial 
department level procurement*. 

w While we have described our preferred embodiments of our invan&m, ti win be understood that those 
skilled in the art both now and tote future, upon the understanding of these discussions wOl make various 
Improvements and enhancements thereto which fall within the scope ot the claims which follow. These 
claims should be construed to maintain the proper protection for the invention first disclosed, 

is Claims 

1. A computer system comprising, & pJuveiiy of muffi-pr ooessof memory elements, each having commu- 
nication paths, processor and memory, and wherein a programmable router is provided for routing data 
and control Information horn one multiprocessor memory element to another mufti-processor memory 

6 element and between nodes of the computer system. 

2. A computer system according to claim i wherein each muitt-proceasor memory element <PME) has 2n 
processors, and communication pams which minimize delays due to chip crossings, 

£$ 3b A computer system according to delm 1 wherein each multiprocessor memory element fPME) has a 
processor, memory and routers wttin a single chip and Internal and external cornmunication paths 
which minimize delays due to chip crossings* each processor memory element having means tor fixed 
and floating point processing, routing and WO control. 

30 4. A computer system accorrJng to claim I further comprising within a processor memory element 

a native instruction set means tor providing an expandable multiply function, a programmer router for 
routing information altemstrvely iefrrtghi. 

NEWS matrix hEWSAjp-dcwrv hypercube, and wherein sad programmable router is employs a 
hardwired distributed router provided by each processing memory element. 

38 

5- A computer computer system according to claim 1 organized as a massively parallel machine with 
nodes interconnected as a n rjmenstonai network cluster with paraDel communicaeon pafra between 
processor memory ©foments along said internal and external wmmunication earns providing a process* 
lug array, and wherein processing memory etemems of an eney have a transparent modo ut»zed when 
<o routing data between processing memory elements within e chip set ot processing memory element* 

for permitting reduction of the effective network diameter of a network of nodes. 

6. A computer system according to claim i wherein a node ot a processor array has muWe single 
processor elements made up of 32K 16-bit words with a 'l&*bil processor for a network node of eight 

* processors with their associated memory with fretr My distributed I/O routers and signal I/O porta. 

7. A computer system according to daim 1 wherein a node of a processor array has muWpte single 
processor elements made up of 32K id-bit words with a i$*>ri processor tor a network node of eight 
processors with their associated memory with then 1 fully distributed VQ routers end signal K> ports 

so combined as groups groups of node clusters organized as a 2d modified hypercube. 

ft, A computer system according to claim i wherein a node oi a processor array has mutrjpfo single 
processor elements made up ot 32K 16-bit words with a i$-bU processor for a network node of eight 
processors with their associated memory with their fully distributed VQ routers and efonai VO porta 
98 combined as groups groups of node clusters organized as a 2d modified hypercube, with up to 64 
clusters integrated in a network ot node clusters to torrn are integrated to form a id modified 
hypercube of up to 32,768 processing memory elements. 
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9. A computer sysem according to claim 1 wharetn a nod* procuring memory element has Internal data 
fto*s using high epeed hard registers to feed fflatrtbuted ALU and and K> router registers and logfo for 

m A oornputer system according to data 1 wherein a node processing memory element m has an K> 
port for off «f»p byte wide oonr*mrtceBon. and has input porta that are catnectad cueh tha data may 
bo muted from Input to memory, or tarn an Input address register to an output register via a direct 
parelel data path. 

it A computet system ncccrdm 3 to dairo 1 wherein a nodo bus multiple processor memory elera9n*s and 
" 3 ^™f C& !i t0 5*** l» a cfcwe. nehwrk « deto routing distributed between hardware end 
software, with software controlling most ottltetasksequsncing tuncffen, 

12. A computer system according to claim l wherein a node has multiple processor memory dements and 
Is connected to other nodes In a cluster network with data rooting distributed between hardware and 
^ISfWtftf^od^ PfOV ' d9d "* ** ,wmU Q hnar Asters and minimizing overhead on the 

ia A c»np^ ^stem aocordlno to claim 1 wherein a node has multiple p^x**,*- memory etemenh a^d 
node * ir> a cJuaar nes»o«k with TO programs et dedicated htemtot level* to, 

managrhg the network. 

14 * ^!!^JTV^^. ta cW,n 1 * has multiple processor memory element* and 
1^2^ n0d ? a 0luelBr n8twort W P"°9»"» at dedicated Interrupt levels for 

manacNntj the network, each processor memory element having interrupt registers and dedicating four 
? fp0m fcur "«*M>o«. a butter provided at asch lev* by badino 

resistors ai the level, and hawing m and return instruction pairs using a buffer address end transfer 
count to enable the processor memory element to accept words horn an input bos and to s»ie them to 
the butter. 

15. A muis-pmoessor memory system, comprising: a plurality of mu^processor memory elements, each 
muta-procesaor memory element (PME) having 2n pmcwsora. rriemery ^ routers wHhin a ctjfe 
and mlernal and external communication padre which minimise delays due to chip crossinos. 
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